You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2016/03/09 19:27:40 UTC

[jira] [Resolved] (PARQUET-374) Add api to read dictionary from each column chunk for predicate pushdown

     [ https://issues.apache.org/jira/browse/PARQUET-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan Blue resolved PARQUET-374.
-------------------------------
    Resolution: Won't Fix

I'm marking this as "Won't fix" because PARQUET-384 includes the proposed API for accessing dictionaries.

> Add api to read dictionary from each column chunk for predicate pushdown
> ------------------------------------------------------------------------
>
>                 Key: PARQUET-374
>                 URL: https://issues.apache.org/jira/browse/PARQUET-374
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Zhenxiao Luo
>            Assignee: Zhenxiao Luo
>
> Parquet files's dictionary could be used for predicate pushdown
> eg.
> SQL query:
> select * from table where column = 10;
> could skip reading the whole row group if the dictionary for column has values [5, 11, 17, 20]
> This could save IO and improve performance.
> We implemented predicate pushdown using dictionary in Presto for parquet files, and benchmark shows up to 40X speedup for selective queries.
> Need to add an api to ParquetFileReader, so that it returns dictionaries for requested columns.
> If the column is not dictionary encoded in this row group, return null.
> If the not all column pages are dictionary encoded in this row group, return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)