You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2016/08/06 03:40:23 UTC
spark git commit: [SPARK-16847][SQL] Prevent to potentially read corrupt statstics on binary in Parquet vectorized reader

Repository: spark
Updated Branches:
  refs/heads/master e679bc3c1 -> 55d6dad6f


[SPARK-16847][SQL] Prevent to potentially read corrupt statstics on binary in Parquet vectorized reader

## What changes were proposed in this pull request?

This problem was found in [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251) and we disabled filter pushdown on binary columns in Spark before. We enabled this after upgrading Parquet but it seems there is potential incompatibility for Parquet files written in lower Spark versions.

Currently, this does not happen in normal Parquet reader. However, In Spark, we implemented a vectorized reader, separately with Parquet's standard API. For normal Parquet reader this is being handled but not in the vectorized reader.

It is okay to just pass `FileMetaData`. This is being handled in parquet-mr (See https://github.com/apache/parquet-mr/commit/e3b95020f777eb5e0651977f654c1662e3ea1f29). This will prevent loading corrupt statistics in each page in Parquet.

This PR replaces the deprecated usage of constructor.

## How was this patch tested?

N/A

Author: hyukjinkwon <gu...@gmail.com>

Closes #14450 from HyukjinKwon/SPARK-16847.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/55d6dad6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/55d6dad6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/55d6dad6

Branch: refs/heads/master
Commit: 55d6dad6f21dd4d50d168f6392242aa8e24b774a
Parents: e679bc3
Author: hyukjinkwon <gu...@gmail.com>
Authored: Sat Aug 6 04:40:24 2016 +0100
Committer: Sean Owen <so...@cloudera.com>
Committed: Sat Aug 6 04:40:24 2016 +0100

----------------------------------------------------------------------
 .../datasources/parquet/SpecificParquetRecordReaderBase.java   | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/55d6dad6/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
----------------------------------------------------------------------
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
index 04752ec..dfe6967 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
@@ -140,7 +140,8 @@ public abstract class SpecificParquetRecordReaderBase<T> extends RecordReader<Vo
     String sparkRequestedSchemaString =
         configuration.get(ParquetReadSupport$.MODULE$.SPARK_ROW_REQUESTED_SCHEMA());
     this.sparkSchema = StructType$.MODULE$.fromString(sparkRequestedSchemaString);
-    this.reader = new ParquetFileReader(configuration, file, blocks, requestedSchema.getColumns());
+    this.reader = new ParquetFileReader(
+        configuration, footer.getFileMetaData(), file, blocks, requestedSchema.getColumns());
     for (BlockMetaData block : blocks) {
       this.totalRowCount += block.getRowCount();
     }
@@ -204,7 +205,8 @@ public abstract class SpecificParquetRecordReaderBase<T> extends RecordReader<Vo
       }
     }
     this.sparkSchema = new ParquetSchemaConverter(config).convert(requestedSchema);
-    this.reader = new ParquetFileReader(config, file, blocks, requestedSchema.getColumns());
+    this.reader = new ParquetFileReader(
+        config, footer.getFileMetaData(), file, blocks, requestedSchema.getColumns());
     for (BlockMetaData block : blocks) {
       this.totalRowCount += block.getRowCount();
     }


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org