You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Andries Engelbrecht (JIRA)" <ji...@apache.org> on 2015/02/04 03:10:35 UTC
[jira] [Created] (DRILL-2157) Directory pruning on subdirectories only and data type conversions for directory filters

Andries Engelbrecht created DRILL-2157:
------------------------------------------

             Summary: Directory pruning on subdirectories only and data type conversions for directory filters
                 Key: DRILL-2157
                 URL: https://issues.apache.org/jira/browse/DRILL-2157
             Project: Apache Drill
          Issue Type: Improvement
          Components: Query Planning & Optimization
    Affects Versions: Future
            Reporter: Andries Engelbrecht
            Assignee: Jinfeng Ni
            Priority: Minor


Drill will scan all files and directories when using only a subdirectory as a predicate. Additionally if the data type for the directory filter is not a string and is converted Drill will also first scan all the subdirectories adn files before applying the filter.

My current observation is that for a directory structure as listed below,
the pruning only works if the full tree is provided. If only a lower level
directory is supplied in the filter condition Drill only uses it as a
filter.

With directory structure as below
/2015
        /01
               /10
               /11
               /12
               /13
               /14

Query:
select count(id) from `/foo` t where dir0='2015' and dir1='01' and
dir2='10'

Produces the correct pruning and query plan

01-02            Project(id=[$3]): rowcount = 3670316.0, cumulative cost =
{1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
28434
01-03              Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]):
rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu,
0.0 io, 0.0 network, 0.0 memory}, id = 28433
01-04                Scan(groupscan=[EasyGroupScan [selectionRoot=/foo,
numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`]


However:

select count(id) from `/foo` t where dir2='10'

Produces full scan of all sub directories and only applies a filter
condition after the fact. Notice the numFiles between the 2, even though it
lists columns in the base scan

01-04                Filter(condition=[=($0, '10')]): rowcount =
9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io,
0.0 network, 0.0 memory}, id = 27470
01-05                  Project(dir2=[$1], id=[$0]): rowcount =
6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0
io, 0.0 network, 0.0 memory}, id = 27469
01-06                    Scan(groupscan=[EasyGroupScan
[selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`]

Also using the wrong data type for the filter produces a full scan

select count(id) from `/foo` where dir_year=2015 and dir_month=01 and dir_day=14

Produces

01-04                Filter(condition=[AND(=(CAST($1):ANY NOT NULL, 2015), =(CAST($2):ANY NOT NULL, 1), =(CAST($3):ANY NOT NULL, 10))]): rowcount = 212034.63825, cumulative cost = {1.88475234E8 rows, 1.005201264E9 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34910
01-05                  Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 2.51300328E8 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34909
01-06                    Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=405, columns=[`id`, `dir0`, `dir1`, `dir2`],





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)