You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2016/07/20 06:52:16 UTC

[GitHub] spark pull request #14278: [SPARK-16632][SQL] Use Spark requested schema to ...

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/14278

    [SPARK-16632][SQL] Use Spark requested schema to guide vectorized Parquet reader initialization

    ## What changes were proposed in this pull request?
    
    In `SpecificParquetRecordReaderBase`, which is used by the vectorized Parquet reader, we convert the Parquet requested schema into a Spark schema to guide column reader initialization. However, the Parquet requested schema is tailored from the schema of the physical file being scanned, and may have inaccurate type information due to bugs of other systems (e.g. HIVE-14294).
    
    On the other hand, we already set the real Spark requested schema into Hadoop configuration in [`ParquetFileFormat`][1]. This PR simply reads out this schema to replace the converted one.
    
    ## How was this patch tested?
    
    New test case added in `ParquetQuerySuite`.
    
    [1]: https://github.com/apache/spark/blob/v2.0.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L292-L294

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark spark-16632-simpler-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14278.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14278
    
----
commit 2ade381403080d1390a34b44366ade05f42f6d4f
Author: Cheng Lian <li...@databricks.com>
Date:   2016-07-20T06:31:10Z

    Fixes SPARK-16632

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    @vanzin Could you please help verify this fix? The reason why #14272 works is that the Parquet requested schema is generated using `clipParquetSchema()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    @liancheng I don't think we should use the Spark requested schema for vectorized Parquet reader. It only works for flat schema. We need the converted schema for complex type support, as I do in #14045. That is because we need the repetition and definition level information for each complex type columns. If we directly use Spark requested schema, we can't get the corresponding info.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62585/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    Thanks for the review! I'm merging this to master and branch-2.0. Will send PRs to revert #14272 since this one is a more general fix of the same issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    **[Test build #62672 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62672/consoleFull)** for PR 14278 at commit [`d532e5e`](https://github.com/apache/spark/commit/d532e5ee9ba81259222074eb795f271a5fb11a0c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    @viirya The updated schema field in this PR is only used to guide the vectorized reader to interpret basic Parquet types into logical types (e.g. Parquet `int32` to Spark `ByteType`, and Parquet `int96` to Spark `TimestampType` since Hive doesn't use proper Parquet types). We are still using the properly Parquet schema tailored from physical file schema as Parquet requested schema after this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    **[Test build #62672 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62672/consoleFull)** for PR 14278 at commit [`d532e5e`](https://github.com/apache/spark/commit/d532e5ee9ba81259222074eb795f271a5fb11a0c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    **[Test build #62585 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62585/consoleFull)** for PR 14278 at commit [`3cff4af`](https://github.com/apache/spark/commit/3cff4af454b9d32c9127af85ae0f53d219920370).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    Change LGTM, thanks. Tried our tests and they work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    @viirya Basically we are mapping the logic in `ParquetRowConverter` for the non-vectorized Parquet reader. It's just implemented at a lower lever in the case of vectorized reader.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14278: [SPARK-16632][SQL] Use Spark requested schema to ...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14278#discussion_r71671818
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java ---
    @@ -136,7 +137,9 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptCont
         ReadSupport.ReadContext readContext = readSupport.init(new InitContext(
             taskAttemptContext.getConfiguration(), toSetMultiMap(fileMetadata), fileSchema));
         this.requestedSchema = readContext.getRequestedSchema();
    -    this.sparkSchema = new ParquetSchemaConverter(configuration).convert(requestedSchema);
    +    String sparkRequestedSchemaString =
    +        configuration.get(ParquetReadSupport$.MODULE$.SPARK_ROW_REQUESTED_SCHEMA());
    +    this.sparkSchema = StructType$.MODULE$.fromString(sparkRequestedSchemaString);
    --- End diff --
    
    Actually it's safer when the Parquet requested schema conforms to the actual physical file to be read. Normally, we shouldn't care about logical types (those with annotations) at the level of Parquet record reader. It's the upper level engine's responsibility to convert basic types like `int32` into logical types like `INT_8` and `INT_16`. The vectorized reader has to mix them up because we need to construct value vectors of proper types at this level.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    **[Test build #62585 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62585/consoleFull)** for PR 14278 at commit [`3cff4af`](https://github.com/apache/spark/commit/3cff4af454b9d32c9127af85ae0f53d219920370).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    Also cc @yhuai.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14278: [SPARK-16632][SQL] Use Spark requested schema to ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14278


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62583/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    **[Test build #62583 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62583/consoleFull)** for PR 14278 at commit [`2ade381`](https://github.com/apache/spark/commit/2ade381403080d1390a34b44366ade05f42f6d4f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62672/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    @liancheng Yea, you are right! After double-checking, the `sparkSchema` is override later. Thanks for clarifying it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14278: [SPARK-16632][SQL] Use Spark requested schema to guide v...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14278
  
    **[Test build #62583 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62583/consoleFull)** for PR 14278 at commit [`2ade381`](https://github.com/apache/spark/commit/2ade381403080d1390a34b44366ade05f42f6d4f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14278: [SPARK-16632][SQL] Use Spark requested schema to ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14278#discussion_r71626483
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java ---
    @@ -136,7 +137,9 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptCont
         ReadSupport.ReadContext readContext = readSupport.init(new InitContext(
             taskAttemptContext.getConfiguration(), toSetMultiMap(fileMetadata), fileSchema));
         this.requestedSchema = readContext.getRequestedSchema();
    -    this.sparkSchema = new ParquetSchemaConverter(configuration).convert(requestedSchema);
    +    String sparkRequestedSchemaString =
    +        configuration.get(ParquetReadSupport$.MODULE$.SPARK_ROW_REQUESTED_SCHEMA());
    +    this.sparkSchema = StructType$.MODULE$.fromString(sparkRequestedSchemaString);
    --- End diff --
    
    We still have the `requestedSchema` in parquet's form, which does not contain the correct annotation. It may still be a potential issue when correct annotations in requestedSchema matters?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org