You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Jinfeng Ni (JIRA)" <ji...@apache.org> on 2016/01/19 00:56:40 UTC
[jira] [Commented] (DRILL-2517) Apply Partition pruning before reading files during planning

    [ https://issues.apache.org/jira/browse/DRILL-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105993#comment-15105993 ] 

Jinfeng Ni commented on DRILL-2517:
-----------------------------------

Pull request: https://github.com/apache/drill/pull/328/files 

The PR contains both the change from Adam and Mehant. I added some code change on top of their change.

I did some preliminary performance comparison on my Mac laptop.  With 115k parquet files in total, it's organized in 25 directories (1990, 1991, ... ), and each directory has four subdirectories (Q1, Q2, Q3, Q4). 

For the following query : 
{code}
explain plan for select * from t1 where dir0= 1990 and dir1 = 'Q1';
{code}

Master branch shows 19.4 seconds,  DRLL-2517 patch shows 8.8 seconds. Both cases are measured for the second run with warm cache. 
{code}
1 row selected (19.434 seconds)

1 row selected (8.845 seconds)
{code} 

The log shows that the time for reading parquet meta data from footer files is significantly reduced (from 7388ms to 102ms) , due the the pruning effect. 

On master branch: 
{code}
Fetch parquet metadata: Executed 115544 out of 115544 using 16 threads. Time: 7388ms total, 1.019393ms avg, 745ms max.
{code}

With patch:
{code}
Fetch parquet metadata: Executed 1111 out of 1111 using 16 threads. Time: 102ms total, 1.053320ms avg, 8ms max.
{code}


> Apply Partition pruning before reading files during planning
> ------------------------------------------------------------
>
>                 Key: DRILL-2517
>                 URL: https://issues.apache.org/jira/browse/DRILL-2517
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Query Planning & Optimization
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Adam Gilmore
>            Assignee: Jinfeng Ni
>             Fix For: Future
>
>
> Partition pruning still tries to read Parquet files during the planning stage even though they don't match the partition filter.
> For example, if there were an invalid Parquet file in a directory that should not be queried:
> {code}
> 0: jdbc:drill:zk=local> select sum(price) from dfs.tmp.purchases where dir0 = 1;
> Query failed: IllegalArgumentException: file:/tmp/purchases/4/0_0_0.parquet is not a Parquet file (too small)
> {code}
> The reason is that the partition pruning happens after the Parquet plugin tries to read the footer of each file.
> Ideally, partition pruning would happen first before the format plugin gets involved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)