You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Gang Ma (Jira)" <ji...@apache.org> on 2019/09/29 10:04:00 UTC

[jira] [Commented] (SPARK-22536) VectorizedParquetRecordReader doesn't use Parquet's dictionary filtering feature

    [ https://issues.apache.org/jira/browse/SPARK-22536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16940330#comment-16940330 ] 

Gang Ma commented on SPARK-22536:
---------------------------------

[~hyukjin.kwon] Why this one is resolved?

> VectorizedParquetRecordReader doesn't use Parquet's dictionary filtering feature
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-22536
>                 URL: https://issues.apache.org/jira/browse/SPARK-22536
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.2.0
>         Environment: Spark 2.2.0
>            Reporter: Ivan Gozali
>            Priority: Major
>              Labels: bulk-closed, filter2, parquet, predicate, pushdown
>
> The VectorizedParquetRecordReader currently only uses statistics filtering, and does not make use of dictionary filtering in Parquet. Having dictionary filtering would be very useful for string/binary columns that have low cardinality
> Some relevant code paths:
> * https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L367-L387 When vectorizedReader is enabled, the code will use VectorizedParquetRecordReader, which uses SpecificParquetRecordReaderBase below
> * https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L109 This is where the row group filtering is being performed. It calls the method below
> * https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java#L64-L70
> The RowGroupFilter constructor used in Spark's {{VectorizedParquetRecordReader}} hard-codes the {{FilterLevel}} used to only {{FilterLevel.STATISTICS}}, and is deprecated.
> {code}
>   @Deprecated
>   private RowGroupFilter(List<BlockMetaData> blocks, MessageType schema) {
>     this.blocks = checkNotNull(blocks, "blocks");
>     this.schema = checkNotNull(schema, "schema");
>     this.levels = Collections.singletonList(FilterLevel.STATISTICS);
>     this.reader = null;
> {code}
> Compare this to {{org.apache.parquet.hadoop.ParquetRecordReader.initialize()}}, which uses the second RowGroupFilter constructor that allows it to set the {{FilterLevel}}. Relevant code here:
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordReader.java#L166-L182



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org