You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by vanzin <gi...@git.apache.org> on 2016/07/19 23:36:25 UTC

[GitHub] spark pull request #14272: [SPARK-16632][sql] Respect Hive schema when mergi...

GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/14272

    [SPARK-16632][sql] Respect Hive schema when merging parquet schema.

    When Hive (or at least certain versions of Hive) creates parquet files
    containing tinyint or smallint columns, it stores them as int32, but
    doesn't annotate the parquet field as containing the corresponding
    int8 / int16 data. When Spark reads those files using the vectorized
    reader, it follows the parquet schema for these fields, but when
    actually reading the data it tries to use the type fetched from
    the metastore, and then fails because data has been loaded into the
    wrong fields in OnHeapColumnVector.
    
    So instead of blindly trusting the parquet schema, check whether the
    Catalyst-provided schema disagrees with it, and adjust the types so
    that the necessary metadata is present when loading the data into
    the ColumnVector instance.
    
    Tested with unit tests and with tests that create byte / short columns
    in Hive and try to read them from Spark.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-16632

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14272.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14272
    
----
commit d853ba0769189b0557d801b9e3e1ed1a9d65cea9
Author: Marcelo Vanzin <va...@cloudera.com>
Date:   2016-07-19T22:51:16Z

    [SPARK-16632][sql] Respect Hive schema when merging parquet schema.
    
    When Hive (or at least certain versions of Hive) creates parquet files
    containing tinyint or smallint columns, it stores them as int32, but
    doesn't annotate the parquet field as containing the corresponding
    int8 / int16 data. When Spark reads those files using the vectorized
    reader, it follows the parquet schema for these fields, but when
    actually reading the data it tries to use the type fetched from
    the metastore, and then fails because data has been loaded into the
    wrong fields in OnHeapColumnVector.
    
    So instead of blindly trusting the parquet schema, check whether the
    Catalyst-provided schema disagrees with it, and adjust the types so
    that the necessary metadata is present when loading the data into
    the ColumnVector instance.
    
    Tested with unit tests and with tests that create byte / short columns
    in Hive and try to read them from Spark.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14272: [SPARK-16632][sql] Respect Hive schema when merging parq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14272
  
    **[Test build #62561 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62561/consoleFull)** for PR 14272 at commit [`d853ba0`](https://github.com/apache/spark/commit/d853ba0769189b0557d801b9e3e1ed1a9d65cea9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14272: [SPARK-16632][sql] Respect Hive schema when merging parq...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14272
  
    Opened #14278 for the simpler yet more general fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14272: [SPARK-16632][sql] Respect Hive schema when merging parq...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14272
  
    This LGTM. Although it's a little bit hacky since technically the fields in requested schema passed to the Parquet record reader may have different original types (`INT_8` and `INT_16`) from the actual ones defined in the physical file, fortunately Parquet record reader doesn't check for original types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14272: [SPARK-16632][sql] Respect Hive schema when merging parq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14272
  
    **[Test build #62561 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62561/consoleFull)** for PR 14272 at commit [`d853ba0`](https://github.com/apache/spark/commit/d853ba0769189b0557d801b9e3e1ed1a9d65cea9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14272: [SPARK-16632][sql] Respect Hive schema when merging parq...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/14272
  
    @yhuai @liancheng 
    
    Not sure if this is the best place for the fix, but the problem is gone with the change. It duplicates some minor logic from `ParquetSchemaConverter`, but it seems weird to call that class from here since this code has no access to config data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14272: [SPARK-16632][sql] Respect Hive schema when merging parq...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14272
  
    I'm merging this to master.
    
    @yhuai Do we want this in branch-2.0?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14272: [SPARK-16632][sql] Respect Hive schema when merging parq...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14272
  
    Discussed with @yhuai, I'm also merging this to branch-2.0.
    
    @vanzin Thanks for fixing this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14272: [SPARK-16632][sql] Respect Hive schema when merging parq...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the issue:

    https://github.com/apache/spark/pull/14272
  
    yea. I think the fix is pretty safe. After discussion with @liancheng, seems the more general fix is to just to use the requested catalyst schema to initialize the vectorized reader.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14272: [SPARK-16632][sql] Respect Hive schema when merging parq...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14272
  
    Would like to add that AFAIK byte and short are the only problematic types that we don't handle before this PR. Other Hive-Parquet schema conversion quirks like string (translated into `binary` without `UTF8` annotation) and timestamp (translated into deprecated `int96`) are already worked around in Spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14272: [SPARK-16632][sql] Respect Hive schema when merging parq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14272
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14272: [SPARK-16632][sql] Respect Hive schema when merging parq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14272
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62561/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14272: [SPARK-16632][sql] Respect Hive schema when mergi...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14272


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org