You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by chetkhatri <gi...@git.apache.org> on 2017/12/19 11:37:44 UTC

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

GitHub user chetkhatri opened a pull request:

    https://github.com/apache/spark/pull/20018

    SPARK-22833 [Improvement] in SparkHive Scala Examples

    ## What changes were proposed in this pull request?
    
    SparkHive Scala Examples Improvement made:
    * Writing DataFrame / DataSet to Hive Managed , Hive External table using different storage format.
    * Implementation of Partition, Reparition, Coalesce with appropriate example.
    
    ## How was this patch tested?
    * Patch has been tested manually and by running ./dev/run-tests.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/chetkhatri/spark scala-sparkhive-examples

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20018.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20018
    
----
commit 9d9b42bb49997ce7d308fbf50072e5f5e0eccaa2
Author: chetkhatri <ck...@gmail.com>
Date:   2017-12-19T11:33:47Z

    SPARK-22833 [Improvement] in SparkHive Scala Examples

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158407717
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    +    val hiveTableDF = sql("SELECT * FROM records")
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +    // Create External Hive table with parquet
    +    sql("CREATE EXTERNAL TABLE records(key int, value string) " +
    +      "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records"
    +
    +    // Save DataFrame to Hive External table as compatible parquet format
    --- End diff --
    
    `parquet` ->`Parquet`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158370168
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    +     * Save DataFrame to Hive Managed table as Parquet format
    +     * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default
    +     * warehouse location will be used to store Hive table Data.
    +     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
    +     * You don't have to explicitly give location for each table, every tables under specified schema will be located at
    +     * location given while creating schema.
    +     * 2. Create Hive Managed table with storage format as 'Parquet'
    +     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
    +     */
    +    val hiveTableDF = sql("SELECT * FROM records").toDF()
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +
    +    /*
    +     * Save DataFrame to Hive External table as compatible parquet format.
    +     * 1. Create Hive External table with storage format as parquet.
    +     * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET;
    +     * Since we are not explicitly providing hive database location, it automatically takes default warehouse location
    +     * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport().
    +     * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories
    +     * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'.
    +     */
    +
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records"
    +    hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
    +    // You can create partitions in Hive table, so downstream queries run much faster.
    +    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
    +      .parquet(hiveExternalTableLocation)
    +    /*
    +    If Data volume is very huge, then every partitions would have many small-small files which may harm
    --- End diff --
    
    @srowen I totally agree with you. I will rephrase content for docs. from here: i have removed as of now. please check and do needful.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20018
  
    Seems this did not passed the test .. this causes a build failure: 
    
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85343/console
    
    ```
    ========================================================================
    Running Scala style checks
    ========================================================================
    Scalastyle checks failed at following occurrences:
    [error] /home/jenkins/workspace/SparkPullRequestBuilder/examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala:138:0: Whitespace at end of line
    [error] Total time: 13 s, completed Dec 23, 2017 7:34:15 AM
    [error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received return code 1
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158407873
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    +    val hiveTableDF = sql("SELECT * FROM records")
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +    // Create External Hive table with parquet
    +    sql("CREATE EXTERNAL TABLE records(key int, value string) " +
    +      "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records"
    +
    +    // Save DataFrame to Hive External table as compatible parquet format
    +    hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    --- End diff --
    
    `turn` -> `Turn`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158454252
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    +    val hiveTableDF = sql("SELECT * FROM records")
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +    // Create External Hive table with parquet
    +    sql("CREATE EXTERNAL TABLE records(key int, value string) " +
    +      "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    --- End diff --
    
    @HyukjinKwon Thanks for highlight, improved the same.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r157757263
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -104,6 +103,60 @@ object SparkHiveExample {
         // ...
         // $example off:spark_hive$
    --- End diff --
    
    Do you not want the code below to render in the docs as part of the example? maybe not, just checking if that's intentional.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158454275
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    +    val hiveTableDF = sql("SELECT * FROM records")
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +    // Create External Hive table with parquet
    +    sql("CREATE EXTERNAL TABLE records(key int, value string) " +
    +      "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records"
    +
    +    // Save DataFrame to Hive External table as compatible parquet format
    +    hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    --- End diff --
    
    @HyukjinKwon Thanks for highlight, improved the same.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158210425
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    +     * Save DataFrame to Hive Managed table as Parquet format
    +     * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default
    +     * warehouse location will be used to store Hive table Data.
    +     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
    +     * You don't have to explicitly give location for each table, every tables under specified schema will be located at
    +     * location given while creating schema.
    +     * 2. Create Hive Managed table with storage format as 'Parquet'
    +     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
    +     */
    +    val hiveTableDF = sql("SELECT * FROM records").toDF()
    --- End diff --
    
    actually, I think `spark.table("records")` is a better example.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158366994
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    --- End diff --
    
    @srowen Done, changes addressed


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158368554
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    +     * Save DataFrame to Hive Managed table as Parquet format
    +     * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default
    +     * warehouse location will be used to store Hive table Data.
    +     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
    +     * You don't have to explicitly give location for each table, every tables under specified schema will be located at
    +     * location given while creating schema.
    +     * 2. Create Hive Managed table with storage format as 'Parquet'
    +     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
    +     */
    +    val hiveTableDF = sql("SELECT * FROM records").toDF()
    --- End diff --
    
    @srowen done cc\ @cloud-fan removed toDF() 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on the issue:

    https://github.com/apache/spark/pull/20018
  
    @holdenk @sameeragarwal  Please do review and do needful .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20018
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158407946
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    +    val hiveTableDF = sql("SELECT * FROM records")
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +    // Create External Hive table with parquet
    +    sql("CREATE EXTERNAL TABLE records(key int, value string) " +
    +      "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records"
    +
    +    // Save DataFrame to Hive External table as compatible parquet format
    +    hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
    +
    +    // You can create partitions in Hive table, so downstream queries run much faster.
    +    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
    +      .parquet(hiveExternalTableLocation)
    +
    +    // reduce number of files for each partition by repartition
    +    hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
    +      .partitionBy("key").parquet(hiveExternalTableLocation)
    +
    +    // Control number of files in each partition by coalesce
    --- End diff --
    
    ` Control number of files` -> ` Control the number of files`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on the issue:

    https://github.com/apache/spark/pull/20018
  
    @srowen Can you please review and if everything seems correct then run test build 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158100765
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -104,6 +103,60 @@ object SparkHiveExample {
         // ...
         // $example off:spark_hive$
    --- End diff --
    
    Why do you turn the example listing off then on again? just remove those two lines


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158454291
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    +    val hiveTableDF = sql("SELECT * FROM records")
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +    // Create External Hive table with parquet
    +    sql("CREATE EXTERNAL TABLE records(key int, value string) " +
    +      "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records"
    +
    +    // Save DataFrame to Hive External table as compatible parquet format
    +    hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
    +
    +    // You can create partitions in Hive table, so downstream queries run much faster.
    +    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
    +      .parquet(hiveExternalTableLocation)
    +
    +    // reduce number of files for each partition by repartition
    +    hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
    +      .partitionBy("key").parquet(hiveExternalTableLocation)
    +
    +    // Control number of files in each partition by coalesce
    --- End diff --
    
    @HyukjinKwon Thanks for highlight, improved the same.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on the issue:

    https://github.com/apache/spark/pull/20018
  
    Thanks @HyukjinKwon @wangyum 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158210666
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    +     * Save DataFrame to Hive Managed table as Parquet format
    +     * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default
    +     * warehouse location will be used to store Hive table Data.
    +     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
    +     * You don't have to explicitly give location for each table, every tables under specified schema will be located at
    +     * location given while creating schema.
    +     * 2. Create Hive Managed table with storage format as 'Parquet'
    +     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
    +     */
    +    val hiveTableDF = sql("SELECT * FROM records").toDF()
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +
    +    /*
    +     * Save DataFrame to Hive External table as compatible parquet format.
    +     * 1. Create Hive External table with storage format as parquet.
    +     * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET;
    --- End diff --
    
    it's weird to create an external table without a location. User may be confused between the difference between managed table and external table.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158133606
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    --- End diff --
    
    Oh just noticed this. You're using javadoc style comments here, but they won't have effect.
    just use the `//` block style for comments that you see above, for consistency.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158133877
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    +     * Save DataFrame to Hive Managed table as Parquet format
    +     * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default
    +     * warehouse location will be used to store Hive table Data.
    +     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
    +     * You don't have to explicitly give location for each table, every tables under specified schema will be located at
    +     * location given while creating schema.
    +     * 2. Create Hive Managed table with storage format as 'Parquet'
    +     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
    +     */
    +    val hiveTableDF = sql("SELECT * FROM records").toDF()
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +
    +    /*
    +     * Save DataFrame to Hive External table as compatible parquet format.
    +     * 1. Create Hive External table with storage format as parquet.
    +     * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET;
    +     * Since we are not explicitly providing hive database location, it automatically takes default warehouse location
    +     * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport().
    +     * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories
    +     * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'.
    +     */
    +
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records"
    +    hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
    +    // You can create partitions in Hive table, so downstream queries run much faster.
    +    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
    +      .parquet(hiveExternalTableLocation)
    +    /*
    +    If Data volume is very huge, then every partitions would have many small-small files which may harm
    --- End diff --
    
    This is more stuff that should go in docs, not comments in an example. It kind of duplicates existing documentation. Is this commentary really needed to illustrate usage of the API? that's the only goal right here. 
    
    What are small-small files? You have some inconsistent capitalization; Parquet should be capitalized but not file, bandwidth, etc.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158407583
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    --- End diff --
    
    `parquet` -> `Parquet`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158210374
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    +     * Save DataFrame to Hive Managed table as Parquet format
    +     * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default
    +     * warehouse location will be used to store Hive table Data.
    +     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
    +     * You don't have to explicitly give location for each table, every tables under specified schema will be located at
    +     * location given while creating schema.
    +     * 2. Create Hive Managed table with storage format as 'Parquet'
    +     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
    +     */
    +    val hiveTableDF = sql("SELECT * FROM records").toDF()
    --- End diff --
    
    `.toDF` is not needed


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158454218
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    --- End diff --
    
    @HyukjinKwon Thanks for highlight, improved the same.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158210714
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    +     * Save DataFrame to Hive Managed table as Parquet format
    +     * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default
    +     * warehouse location will be used to store Hive table Data.
    +     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
    +     * You don't have to explicitly give location for each table, every tables under specified schema will be located at
    +     * location given while creating schema.
    +     * 2. Create Hive Managed table with storage format as 'Parquet'
    +     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
    +     */
    +    val hiveTableDF = sql("SELECT * FROM records").toDF()
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +
    +    /*
    +     * Save DataFrame to Hive External table as compatible parquet format.
    +     * 1. Create Hive External table with storage format as parquet.
    +     * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET;
    +     * Since we are not explicitly providing hive database location, it automatically takes default warehouse location
    +     * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport().
    +     * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories
    +     * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'.
    +     */
    +
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records"
    +    hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
    +    // You can create partitions in Hive table, so downstream queries run much faster.
    +    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
    +      .parquet(hiveExternalTableLocation)
    +    /*
    +    If Data volume is very huge, then every partitions would have many small-small files which may harm
    +    downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O.
    +    To improve performance you can create single parquet file under each partition directory using 'repartition'
    +    on partitioned key for Hive table. When you add partition to table, there will be change in table DDL.
    +    Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET;
    +     */
    +    hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
    --- End diff --
    
    This is not a standard usage, let's not put it in the example.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20018
  
    I just opened a quick hotfix - https://github.com/apache/spark/pull/20065 as I think we don't run examples in the build and tests and all we need would just be the style.
    
    Reverting works also fine to me @srowen. I can close mine.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r157942866
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -104,6 +103,60 @@ object SparkHiveExample {
         // ...
         // $example off:spark_hive$
    --- End diff --
    
    @srowen Can you please review this cc\ @holdenk @sameeragarwal 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158134032
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    +     * Save DataFrame to Hive Managed table as Parquet format
    +     * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default
    +     * warehouse location will be used to store Hive table Data.
    +     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
    +     * You don't have to explicitly give location for each table, every tables under specified schema will be located at
    +     * location given while creating schema.
    +     * 2. Create Hive Managed table with storage format as 'Parquet'
    +     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
    +     */
    +    val hiveTableDF = sql("SELECT * FROM records").toDF()
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +
    +    /*
    +     * Save DataFrame to Hive External table as compatible parquet format.
    +     * 1. Create Hive External table with storage format as parquet.
    +     * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET;
    +     * Since we are not explicitly providing hive database location, it automatically takes default warehouse location
    +     * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport().
    +     * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories
    +     * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'.
    +     */
    +
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records"
    +    hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
    +    // You can create partitions in Hive table, so downstream queries run much faster.
    +    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
    +      .parquet(hiveExternalTableLocation)
    +    /*
    +    If Data volume is very huge, then every partitions would have many small-small files which may harm
    +    downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O.
    +    To improve performance you can create single parquet file under each partition directory using 'repartition'
    +    on partitioned key for Hive table. When you add partition to table, there will be change in table DDL.
    +    Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET;
    +     */
    +    hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
    +      .partitionBy("key").parquet(hiveExternalTableLocation)
    +
    +    /*
    +     You can also do coalesce to control number of files under each partitions, repartition does full shuffle and equal
    +     data distribution to all partitions. here coalesce can reduce number of files to given 'Int' argument without
    --- End diff --
    
    Sentences need some cleanup here. What do you mean by 'Int' argument? maybe it's best to point people to the API docs rather than incompletely repeat it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158407620
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    --- End diff --
    
    `Managed` -> `managed`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158407739
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    +    val hiveTableDF = sql("SELECT * FROM records")
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +    // Create External Hive table with parquet
    +    sql("CREATE EXTERNAL TABLE records(key int, value string) " +
    +      "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
    +    // to make Hive parquet format compatible with spark parquet format
    --- End diff --
    
    `parquet` ->`Parquet`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158408627
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    +    val hiveTableDF = sql("SELECT * FROM records")
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +    // Create External Hive table with parquet
    +    sql("CREATE EXTERNAL TABLE records(key int, value string) " +
    +      "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
    +    // to make Hive parquet format compatible with spark parquet format
    --- End diff --
    
    `spark` -> `Spark`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158368719
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    +     * Save DataFrame to Hive Managed table as Parquet format
    +     * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default
    +     * warehouse location will be used to store Hive table Data.
    +     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
    +     * You don't have to explicitly give location for each table, every tables under specified schema will be located at
    +     * location given while creating schema.
    +     * 2. Create Hive Managed table with storage format as 'Parquet'
    +     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
    +     */
    +    val hiveTableDF = sql("SELECT * FROM records").toDF()
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +
    +    /*
    +     * Save DataFrame to Hive External table as compatible parquet format.
    +     * 1. Create Hive External table with storage format as parquet.
    +     * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET;
    --- End diff --
    
    @cloud-fan we'll keep all comments description at documentation with user friendly lines. I have added location also.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20018
  
    Thank you @wangyum :D.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158407910
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    +    val hiveTableDF = sql("SELECT * FROM records")
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +    // Create External Hive table with parquet
    +    sql("CREATE EXTERNAL TABLE records(key int, value string) " +
    +      "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records"
    +
    +    // Save DataFrame to Hive External table as compatible parquet format
    +    hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
    +
    +    // You can create partitions in Hive table, so downstream queries run much faster.
    +    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
    +      .parquet(hiveExternalTableLocation)
    +
    +    // reduce number of files for each partition by repartition
    --- End diff --
    
    `reduce` -> `Reduce`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158210754
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    +     * Save DataFrame to Hive Managed table as Parquet format
    +     * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default
    +     * warehouse location will be used to store Hive table Data.
    +     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
    +     * You don't have to explicitly give location for each table, every tables under specified schema will be located at
    +     * location given while creating schema.
    +     * 2. Create Hive Managed table with storage format as 'Parquet'
    +     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
    +     */
    +    val hiveTableDF = sql("SELECT * FROM records").toDF()
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +
    +    /*
    +     * Save DataFrame to Hive External table as compatible parquet format.
    +     * 1. Create Hive External table with storage format as parquet.
    +     * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET;
    +     * Since we are not explicitly providing hive database location, it automatically takes default warehouse location
    +     * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport().
    +     * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories
    +     * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'.
    +     */
    +
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records"
    +    hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
    +    // You can create partitions in Hive table, so downstream queries run much faster.
    +    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
    +      .parquet(hiveExternalTableLocation)
    +    /*
    +    If Data volume is very huge, then every partitions would have many small-small files which may harm
    +    downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O.
    +    To improve performance you can create single parquet file under each partition directory using 'repartition'
    +    on partitioned key for Hive table. When you add partition to table, there will be change in table DDL.
    +    Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET;
    +     */
    +    hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
    +      .partitionBy("key").parquet(hiveExternalTableLocation)
    +
    +    /*
    +     You can also do coalesce to control number of files under each partitions, repartition does full shuffle and equal
    +     data distribution to all partitions. here coalesce can reduce number of files to given 'Int' argument without
    +     full data shuffle.
    +     */
    +    // coalesce of 10 could create 10 parquet files under each partitions,
    +    // if data is huge and make sense to do partitioning.
    +    hiveTableDF.coalesce(10).write.mode(SaveMode.Overwrite)
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r157973588
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -104,6 +103,60 @@ object SparkHiveExample {
         // ...
         // $example off:spark_hive$
    --- End diff --
    
    @srowen I have updated DDL when storing data with parititoning in Hive.
    cc\ @HyukjinKwon @mgaido91 @markgrover @markhamstra 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/20018
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on the issue:

    https://github.com/apache/spark/pull/20018
  
    Adding other contributor of the same file for review. cc\
    @cloud-fan 
    @aokolnychyi
    @liancheng 
    @HyukjinKwon


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158370509
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    +     * Save DataFrame to Hive Managed table as Parquet format
    +     * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default
    +     * warehouse location will be used to store Hive table Data.
    +     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
    +     * You don't have to explicitly give location for each table, every tables under specified schema will be located at
    +     * location given while creating schema.
    +     * 2. Create Hive Managed table with storage format as 'Parquet'
    +     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
    +     */
    +    val hiveTableDF = sql("SELECT * FROM records").toDF()
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +
    +    /*
    +     * Save DataFrame to Hive External table as compatible parquet format.
    +     * 1. Create Hive External table with storage format as parquet.
    +     * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET;
    +     * Since we are not explicitly providing hive database location, it automatically takes default warehouse location
    +     * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport().
    +     * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories
    +     * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'.
    +     */
    +
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records"
    +    hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
    +    // You can create partitions in Hive table, so downstream queries run much faster.
    +    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
    +      .parquet(hiveExternalTableLocation)
    +    /*
    +    If Data volume is very huge, then every partitions would have many small-small files which may harm
    +    downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O.
    +    To improve performance you can create single parquet file under each partition directory using 'repartition'
    +    on partitioned key for Hive table. When you add partition to table, there will be change in table DDL.
    +    Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET;
    +     */
    +    hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
    --- End diff --
    
    @cloud-fan removed all comments , as discussed with @srowen it does really make sense to have at docs with removed inconsitency.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on the issue:

    https://github.com/apache/spark/pull/20018
  
    @HyukjinKwon @srowen Kindly review now, if looks good do merge. Thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158113948
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -104,6 +103,60 @@ object SparkHiveExample {
         // ...
         // $example off:spark_hive$
    --- End diff --
    
    @srowen I mis-understood your first comment. I have reverted as suggested. Please check now


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158370581
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    +     * Save DataFrame to Hive Managed table as Parquet format
    +     * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default
    +     * warehouse location will be used to store Hive table Data.
    +     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
    +     * You don't have to explicitly give location for each table, every tables under specified schema will be located at
    +     * location given while creating schema.
    +     * 2. Create Hive Managed table with storage format as 'Parquet'
    +     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
    +     */
    +    val hiveTableDF = sql("SELECT * FROM records").toDF()
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +
    +    /*
    +     * Save DataFrame to Hive External table as compatible parquet format.
    +     * 1. Create Hive External table with storage format as parquet.
    +     * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET;
    +     * Since we are not explicitly providing hive database location, it automatically takes default warehouse location
    +     * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport().
    +     * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories
    +     * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'.
    +     */
    +
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records"
    +    hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
    +    // You can create partitions in Hive table, so downstream queries run much faster.
    +    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
    +      .parquet(hiveExternalTableLocation)
    +    /*
    +    If Data volume is very huge, then every partitions would have many small-small files which may harm
    +    downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O.
    +    To improve performance you can create single parquet file under each partition directory using 'repartition'
    +    on partitioned key for Hive table. When you add partition to table, there will be change in table DDL.
    +    Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET;
    +     */
    +    hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
    +      .partitionBy("key").parquet(hiveExternalTableLocation)
    +
    +    /*
    +     You can also do coalesce to control number of files under each partitions, repartition does full shuffle and equal
    +     data distribution to all partitions. here coalesce can reduce number of files to given 'Int' argument without
    --- End diff --
    
    @srowen done.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158454265
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    +    val hiveTableDF = sql("SELECT * FROM records")
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +    // Create External Hive table with parquet
    +    sql("CREATE EXTERNAL TABLE records(key int, value string) " +
    +      "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records"
    +
    +    // Save DataFrame to Hive External table as compatible parquet format
    --- End diff --
    
    @HyukjinKwon Thanks for highlight, improved the same.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/20018
  
    @chetkhatri no need to keep pinging. We intentionally leave these changes open for review for a day or more to make sure everyone has seen it who wants to.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158454282
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    +    val hiveTableDF = sql("SELECT * FROM records")
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +    // Create External Hive table with parquet
    +    sql("CREATE EXTERNAL TABLE records(key int, value string) " +
    +      "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    +    val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records"
    +
    +    // Save DataFrame to Hive External table as compatible parquet format
    +    hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
    +
    +    // You can create partitions in Hive table, so downstream queries run much faster.
    +    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
    +      .parquet(hiveExternalTableLocation)
    +
    +    // reduce number of files for each partition by repartition
    --- End diff --
    
    @HyukjinKwon Thanks for highlight, improved the same.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on the issue:

    https://github.com/apache/spark/pull/20018
  
    @srowen Apologize, i was not aware with that PMC member gets auto notification for the same. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158454240
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    +    val hiveTableDF = sql("SELECT * FROM records")
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +    // Create External Hive table with parquet
    +    sql("CREATE EXTERNAL TABLE records(key int, value string) " +
    +      "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
    +    // to make Hive parquet format compatible with spark parquet format
    --- End diff --
    
    @HyukjinKwon Thanks for highlight, improved the same.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158407768
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    +    val hiveTableDF = sql("SELECT * FROM records")
    +    hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +    // Create External Hive table with parquet
    +    sql("CREATE EXTERNAL TABLE records(key int, value string) " +
    +      "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +
    +    // Multiple parquet files could be created accordingly to volume of data under directory given.
    --- End diff --
    
    `parquet` -> `Parquet`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158210132
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    --- End diff --
    
    +1


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

Posted by wangyum <gi...@git.apache.org>.

Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/20018
  
    Thanks @HyukjinKwon


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158454228
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -102,8 +101,41 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    // Create Hive managed table with parquet
    +    sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
    +    // Save DataFrame to Hive Managed table as Parquet format
    --- End diff --
    
    @HyukjinKwon Thanks for highlight, improved the same.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r157796580
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala ---
    @@ -104,6 +103,60 @@ object SparkHiveExample {
         // ...
         // $example off:spark_hive$
    --- End diff --
    
    @srowen Thank you for valueable feedback review, I have added that so it can help other develoeprs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20018


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org