You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "Goden Yao (JIRA)" <ji...@apache.org> on 2016/07/05 19:01:11 UTC
[jira] [Comment Edited] (HAWQ-886) Support PXF filter push down for ORC

    [ https://issues.apache.org/jira/browse/HAWQ-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362999#comment-15362999 ] 

Goden Yao edited comment on HAWQ-886 at 7/5/16 7:01 PM:
--------------------------------------------------------

Based on an initial performance evaluation on Reading ORC files with features turned on/off
Data ~ 500000 rows 6 columns with primitive data types

Read using native row based reader ~ 1500ms
Read using Vectorizedbatch reader (default 1024 batch size) ~ 1000ms
Read with filter (7500 rows) ~ 750ms
Read without filter with column projection ~ 850ms
Read with filter with column projection ~ 600ms

Over all we can achieve roughly a 60% speedup over a rather small dataset.


was (Author: shivram):
Based on an initial performance evaluation on Reading ORC files with features turned on/off
Data ~ 500000 rows 6 columns with primitive data types

Read using naive row based reader ~ 1500ms
Read using Vectorizedbatch reader (default 1024 batch size) ~ 1000ms
Read with filter (7500 rows) ~ 750ms
Read without filter with column projection ~ 850ms
Read with filter with column projection ~ 600ms

Over all we can achieve roughly a 60% speedup over a rather small dataset.

> Support PXF filter push down for ORC
> ------------------------------------
>
>                 Key: HAWQ-886
>                 URL: https://issues.apache.org/jira/browse/HAWQ-886
>             Project: Apache HAWQ
>          Issue Type: New Feature
>          Components: PXF
>            Reporter: Shivram Mani
>            Assignee: Shivram Mani
>             Fix For: 2.1.0
>
>
> Currently HAWQ when reading ORC files via PXF (using the default Hive profile) doesn’t push down any of the filter information down to the underlying ORC reader. The only filter that is possible right now is at the level of partition and is generically done for all Hive tables.
> ORC internally contains file level, stripe level and row level statistics including information such as min,max values etc. For more information refer to https://orc.apache.org/docs/indexes.html
> The proposal here is to introduce a new PXF profile optimized for ORC files which leverages these stats to improve the performance of HAWQ queries with predicates. We will also use the Vectorized approach while reading as opposed to the existing reader which is row based on more expensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)