You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Yash Datta (JIRA)" <ji...@apache.org> on 2014/11/08 16:13:33 UTC

[jira] [Commented] (PARQUET-128) Optimize the parquet RecordReader implementation when: A. filterpredicate is pushed down , B. filterpredicate is pushed down on a flat schema

    [ https://issues.apache.org/jira/browse/PARQUET-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203458#comment-14203458 ] 

Yash Datta commented on PARQUET-128:
------------------------------------

For a 47 million row parquet table in spark-sql with schema with no nested/repeating columns ; applying a simple filter :

select * from d_tup_parq_yash where id = 10;   (which returns a single record) 
or
select * from d_tup_parq_yash where id < 500; (which returns 500 records)

time taken before the patch : 4.4 seconds
time taken after the patch : 2.6 seconds 

> Optimize the parquet RecordReader implementation when:  A. filterpredicate is pushed down , B. filterpredicate is pushed down on a flat schema 
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-128
>                 URL: https://issues.apache.org/jira/browse/PARQUET-128
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.6.0rc2
>            Reporter: Yash Datta
>             Fix For: parquet-mr_1.6.0
>
>
> The RecordReader implementation currently will read all the columns before applying the filter predicate and deciding whether to keep the row or discard it.
> We can have a RecordReader which will only assemble the columns on which filters are applied (which are usually a few), then apply the filter and decide whether to keep the row or not , and then goes on to assemble the remaining columns or skip the remaining columns accordingly.
> Also for applications like spark sql , the schema usually applied is a flat one with no repeating or nested columns. In such cases, its better to have a light-weight, faster RecordReader.
> The performance improvement by this change is seen to be significant , and is better in case smaller number of rows are returned by filtering (which is usually the case) and there are many number of columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)