You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2015/04/03 15:03:03 UTC

[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/5348

    [Doc] [SQL] Addes Hive metastore Parquet table conversion section

    This PR adds a section about Hive metastore Parquet table conversion. It documents:
    
    1. Schema reconciliation rules introduced in #5214 (see [this comment] [1] in #5188)
    2. Metadata refreshing requirement introduced in #5339
    
    Notice that Python snippet for refreshing tables is not included, because we don't have `refreshTable` in PySpark. This should be addressed in a separate PR.
    
    [1]: https://github.com/apache/spark/pull/5188#issuecomment-86531248

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark sql-doc-parquet-conversion

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5348.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5348
    
----
commit 9840affeb34c2c7ee21cec366af36bc655c4b4fc
Author: Cheng Lian <li...@databricks.com>
Date:   2015-04-03T12:56:53Z

    Addes Hive metastore Parquet table conversion section

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/5348#issuecomment-89323270
  
    @yhuai This typo has been fixed in master and branch-1.3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5348#issuecomment-89283177
  
      [Test build #29669 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29669/consoleFull) for   PR 5348 at commit [`9840aff`](https://github.com/apache/spark/commit/9840affeb34c2c7ee21cec366af36bc655c4b4fc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5348#issuecomment-89305010
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29669/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5348#issuecomment-96769618
  
      [Test build #31029 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31029/consoleFull) for   PR 5348 at commit [`22d7b14`](https://github.com/apache/spark/commit/22d7b14da6e8a95e4d73ce768364f0279e7d5a85).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5348#discussion_r27737693
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1034,6 +1034,79 @@ df3.printSchema()
     
     </div>
     
    +### Hive metastore Parquet table conversion
    +
    +When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own
    +Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the
    +`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by default.
    +
    +#### Hive/Parquet Schema Reconciliation
    +
    +There are two key differences between Hive and Parquet from the perspective of table schema
    +processing.
    +
    +1. Hive is case insensitive, while Parquet is not
    +1. Hive considers all columns nullable, while nullability in Parquet is significant
    +
    +Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a
    +Hive metastore Parquet table to a Spark SQL Parquet table.  The reconciliation rules are:
    +
    +1. Fields that have the same name in both schema must have the same data type regardless of
    +   nullability.  The reconciled field should have the data type of the Parquet side, so that
    +   nullability is respected.
    +
    +1. The reconciled schema contains exactly those fields defined in Hive metastore schema.
    +
    +   - Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
    +   - Any fileds that only appear in the Hive metastore schema are added as nullable field in the
    +     reconciled schema.
    +
    +#### Metadata Refreshing
    +
    +Spark SQL caches Parquet metadata for better performance.  When Hive metastore Parquet table
    --- End diff --
    
    Actually, I was wondering if we have a section to explain what is a data source table?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5348#discussion_r27737548
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1034,6 +1034,79 @@ df3.printSchema()
     
     </div>
     
    +### Hive metastore Parquet table conversion
    +
    +When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own
    +Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the
    +`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by default.
    +
    +#### Hive/Parquet Schema Reconciliation
    +
    +There are two key differences between Hive and Parquet from the perspective of table schema
    +processing.
    +
    +1. Hive is case insensitive, while Parquet is not
    +1. Hive considers all columns nullable, while nullability in Parquet is significant
    +
    +Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a
    +Hive metastore Parquet table to a Spark SQL Parquet table.  The reconciliation rules are:
    +
    +1. Fields that have the same name in both schema must have the same data type regardless of
    +   nullability.  The reconciled field should have the data type of the Parquet side, so that
    +   nullability is respected.
    +
    +1. The reconciled schema contains exactly those fields defined in Hive metastore schema.
    +
    +   - Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
    +   - Any fileds that only appear in the Hive metastore schema are added as nullable field in the
    +     reconciled schema.
    +
    +#### Metadata Refreshing
    --- End diff --
    
    Seems it is under `Hive metastore Parquet table conversion`. However, users may need to call `refresh table` in other cases, right? For example, when they manually copy data to the dir of a data source table. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5348#issuecomment-89354336
  
      [Test build #29674 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29674/consoleFull) for   PR 5348 at commit [`22d7b14`](https://github.com/apache/spark/commit/22d7b14da6e8a95e4d73ce768364f0279e7d5a85).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5348#issuecomment-89295605
  
      [Test build #29670 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29670/consoleFull) for   PR 5348 at commit [`a56483a`](https://github.com/apache/spark/commit/a56483a58470600fe7beab8c75f5c943302457f8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/5348#issuecomment-89295091
  
    Added Python snippet for metadata refreshing after adding `refreshTable` for PySpark  in #5349.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5348#issuecomment-89354382
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29674/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5348#issuecomment-89304990
  
      [Test build #29669 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29669/consoleFull) for   PR 5348 at commit [`9840aff`](https://github.com/apache/spark/commit/9840affeb34c2c7ee21cec366af36bc655c4b4fc).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5348#issuecomment-89325371
  
      [Test build #29674 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29674/consoleFull) for   PR 5348 at commit [`22d7b14`](https://github.com/apache/spark/commit/22d7b14da6e8a95e4d73ce768364f0279e7d5a85).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/5348#issuecomment-89312941
  
    I just found we have an example having the old code (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema). We still have the following 
    ```
    // Import Spark SQL data types and Row.
    import org.apache.spark.sql._
    ```
    Can you update it to use `import org.apache.spark.sql.types._` for data types?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5348#issuecomment-89318064
  
      [Test build #29670 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29670/consoleFull) for   PR 5348 at commit [`a56483a`](https://github.com/apache/spark/commit/a56483a58470600fe7beab8c75f5c943302457f8).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5348#issuecomment-89318108
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29670/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5348#discussion_r27728364
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -21,7 +21,7 @@ The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.
     All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell` or the `pyspark` shell.
     
     
    -## Starting Point: `SQLContext`
    +## Starting Point: SQLContext
    --- End diff --
    
    Removed the backquotes because the rendered text looks funny.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5348#discussion_r27770121
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1034,6 +1034,79 @@ df3.printSchema()
     
     </div>
     
    +### Hive metastore Parquet table conversion
    +
    +When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own
    +Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the
    +`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by default.
    +
    +#### Hive/Parquet Schema Reconciliation
    +
    +There are two key differences between Hive and Parquet from the perspective of table schema
    +processing.
    +
    +1. Hive is case insensitive, while Parquet is not
    +1. Hive considers all columns nullable, while nullability in Parquet is significant
    +
    +Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a
    +Hive metastore Parquet table to a Spark SQL Parquet table.  The reconciliation rules are:
    +
    +1. Fields that have the same name in both schema must have the same data type regardless of
    +   nullability.  The reconciled field should have the data type of the Parquet side, so that
    +   nullability is respected.
    +
    +1. The reconciled schema contains exactly those fields defined in Hive metastore schema.
    +
    +   - Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
    +   - Any fileds that only appear in the Hive metastore schema are added as nullable field in the
    +     reconciled schema.
    +
    +#### Metadata Refreshing
    +
    +Spark SQL caches Parquet metadata for better performance.  When Hive metastore Parquet table
    --- End diff --
    
    Agree, missing such a section is part of the reason why I put the metadata refreshing section here...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org