You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Andries Engelbrecht (JIRA)" <ji...@apache.org> on 2015/04/09 02:23:12 UTC
[jira] [Commented] (DRILL-2157) Directory pruning on subdirectories only and data type conversions for directory filters

    [ https://issues.apache.org/jira/browse/DRILL-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486406#comment-14486406 ] 

Andries Engelbrecht commented on DRILL-2157:
--------------------------------------------

Can be marked as resolved.

Tested with following use cases and shows the correct numFiles being scanned.
Casting or string no longer needed for lower level directories.

Seems to be resolved in Drill 0.8 for direct DFS access

select count(id) from `/nfl` where dir0=2015 and dir1=01 and dir2=20 and dir3 between 00 and 05;

select count(id) from `/nfl` where dir0=2015 and dir1=01 and dir2 between 20 and 25 and dir3 between 00 and 05;

select count(id) from `/nfl` where dir0=2015 and dir1=01 and dir2 between 20 and 25 and dir3 between 00 and 05;

select count(id) from `/nfl` where dir2 between 20 and 25 and dir3 between 00 and 05;

select count(id) from `/nfl` where dir2>25 and dir3 between 00 and 05;

For Views also seems to report the correct number of files to be scanned with directory pruning

select count(id) from dfs.views.tweet_base where dir_year=2015 and dir_month=01 and dir_day=26 and dir_hour>20;

select count(id) from dfs.views.tweet_base where dir_day between 20 and 26 and dir_hour>20;

> Directory pruning on subdirectories only and data type conversions for directory filters
> ----------------------------------------------------------------------------------------
>
>                 Key: DRILL-2157
>                 URL: https://issues.apache.org/jira/browse/DRILL-2157
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>    Affects Versions: Future
>            Reporter: Andries Engelbrecht
>            Assignee: Aman Sinha
>            Priority: Minor
>             Fix For: 0.8.0
>
>
> Drill will scan all files and directories when using only a subdirectory as a predicate. Additionally if the data type for the directory filter is not a string and is converted Drill will also first scan all the subdirectories adn files before applying the filter.
> My current observation is that for a directory structure as listed below,
> the pruning only works if the full tree is provided. If only a lower level
> directory is supplied in the filter condition Drill only uses it as a
> filter.
> With directory structure as below
> /2015
>         /01
>                /10
>                /11
>                /12
>                /13
>                /14
> Query:
> select count(id) from `/foo` t where dir0='2015' and dir1='01' and
> dir2='10'
> Produces the correct pruning and query plan
> 01-02            Project(id=[$3]): rowcount = 3670316.0, cumulative cost =
> {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
> 28434
> 01-03              Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]):
> rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu,
> 0.0 io, 0.0 network, 0.0 memory}, id = 28433
> 01-04                Scan(groupscan=[EasyGroupScan [selectionRoot=/foo,
> numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`]
> However:
> select count(id) from `/foo` t where dir2='10'
> Produces full scan of all sub directories and only applies a filter
> condition after the fact. Notice the numFiles between the 2, even though it
> lists columns in the base scan
> 01-04                Filter(condition=[=($0, '10')]): rowcount =
> 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io,
> 0.0 network, 0.0 memory}, id = 27470
> 01-05                  Project(dir2=[$1], id=[$0]): rowcount =
> 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0
> io, 0.0 network, 0.0 memory}, id = 27469
> 01-06                    Scan(groupscan=[EasyGroupScan
> [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`]
> Also using the wrong data type for the filter produces a full scan
> select count(id) from `/foo` where dir_year=2015 and dir_month=01 and dir_day=14
> Produces
> 01-04                Filter(condition=[AND(=(CAST($1):ANY NOT NULL, 2015), =(CAST($2):ANY NOT NULL, 1), =(CAST($3):ANY NOT NULL, 10))]): rowcount = 212034.63825, cumulative cost = {1.88475234E8 rows, 1.005201264E9 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34910
> 01-05                  Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 2.51300328E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34909
> 01-06                    Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=405, columns=[`id`, `dir0`, `dir1`, `dir2`],



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)