You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Adam Gilmore (JIRA)" <ji...@apache.org> on 2015/04/17 07:22:58 UTC
[jira] [Commented] (DRILL-1950) Implement filter pushdown for Parquet

    [ https://issues.apache.org/jira/browse/DRILL-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499268#comment-14499268 ] 

Adam Gilmore commented on DRILL-1950:
-------------------------------------

I've been thinking about this one and I think there could be a good first step towards it.  In your instance of having the flat reader reading vectorized copies of the parquet file, you probably still want this to occur, even if it reads records that don't match the filter.  In this case, your optimizer rule would not actually remove the filter - just pushdown the filter to Parquet and then run a final, more optimized filter across it in the query plan anyway.

This means we could get some "quick" bang for our buck by just skipping pages where the min/max for columns explicitly don't match the filter.

Long term, we should be removing the filter from the query plan and pushing it down exclusively to the Parquet reader (when possible), but I think the above could be a really great first step.

What do we think?

> Implement filter pushdown for Parquet
> -------------------------------------
>
>                 Key: DRILL-1950
>                 URL: https://issues.apache.org/jira/browse/DRILL-1950
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: Jason Altekruse
>            Assignee: Jacques Nadeau
>             Fix For: Future
>
>
> The parquet reader currently supports project pushdown, for limiting the number of columns read, however it does not use filter pushdown to read a subset of the requested columns. This is particularly useful with parquet files that contain statistics, most importantly min and max values on pages. Evaluating predicates against these values could save some major reading and decoding time.
> The largest barrier to implementing this is the current design of the reader. Firstly, we currently have two separate parquet readers, one for reading flat files very quickly and another or reading complex data. There are enhancements we can make the the flat reader, to make it support nested data in a much more efficient manner. However the speed of the flat file reader currently comes from being able to make vectorized copies out the the parquet file. This design is somewhat at odds with filter pushdown, as we will only can make useful vectorized copies if the filter matches a large run of values within the file. This might not be too rare a case, assuming files are often somewhat sorted on a primary field like date or a numeric key, and these are often fields used to limit the query to a subset of the data. However for cases where we are filter out a few records here and there, we should just make individual copies.
> We need to do more design work on the best way to balance performance with these use cases in mind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)