You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Dechang Gu (JIRA)" <ji...@apache.org> on 2017/02/07 21:22:41 UTC

[jira] [Assigned] (DRILL-4827) Checking modification time of directories takes too long, needs to be improved

     [ https://issues.apache.org/jira/browse/DRILL-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dechang Gu reassigned DRILL-4827:
---------------------------------

    Assignee: Aman Sinha

> Checking modification time of directories takes too long, needs to be improved
> ------------------------------------------------------------------------------
>
>                 Key: DRILL-4827
>                 URL: https://issues.apache.org/jira/browse/DRILL-4827
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Functions - Drill
>    Affects Versions: 1.8.0
>         Environment: RHEL 6
>            Reporter: Dechang Gu
>            Assignee: Aman Sinha
>
> This is tracking bug for metadata cache performance for directory checking.
> When evaluating the fix for Drill-4530, we run the following two queries on 50K parquet files in a 3-layer directory hierarchy:
> Query1: explain plan for select * from dfs.`/tpchMetaParquet/tpch100_dir_partitioned_50000files/lineitem` where dir0=2006 and dir1=12 and dir2=15;
> Query2:  explain plan for select * from dfs.`/tpchMetaParquet/tpch100_dir_partitioned_50000files/lineitem/2006/12/15`;
> Query1 takes 3.254 secs. Query2 0.505 secs.
> Drillbit.log shows that for Query1, 2.5 secs spent after metadata cache was read and before partition pruning:
> 2016-08-02 15:43:43,051 ucs-node7.perf.lab [285edddf-b1f3-cd74-e826-84cb91ebc6e1:foreman] INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id 285edddf-b1f3-cd74-e826-84cb91ebc6e1: explain plan for select * from dfs.`/tpchMetaParquet/tpch100_dir_partitioned_50000files/lineitem` where dir0=2006 and dir1=12 and dir2=15
> 2016-08-02 15:43:43,193 ucs-node7.perf.lab [285edddf-b1f3-cd74-e826-84cb91ebc6e1:foreman] INFO  o.a.d.exec.store.parquet.Metadata - Took 6 ms to read directories from directory cache file
> 2016-08-02 15:43:45,745 ucs-node7.perf.lab [285edddf-b1f3-cd74-e826-84cb91ebc6e1:foreman] INFO  o.a.d.e.p.l.partition.PruneScanRule - Beginning partition pruning, pruning class: org.apache.drill.exec.planner.logical.partition.PruneScanRule$DirPruneScanFilterOnScanRule
> Further investigation shows that the 2.5 secs was for checking modification time of directories, which is proportional to the number of directories to be checked.  
> Looks like this can be improved by only checking the top level directory. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)