You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Victoria Markman (JIRA)" <ji...@apache.org> on 2015/04/15 01:02:59 UTC
[jira] [Created] (DRILL-2794) Partition pruning is not happening
correctly when maxdir/mindir is used in the filter condition
Victoria Markman created DRILL-2794:
---------------------------------------
Summary: Partition pruning is not happening correctly when maxdir/mindir is used in the filter condition
Key: DRILL-2794
URL: https://issues.apache.org/jira/browse/DRILL-2794
Project: Apache Drill
Issue Type: Bug
Components: Query Planning & Optimization
Affects Versions: 0.9.0
Reporter: Victoria Markman
Assignee: Jinfeng Ni
Directory structure:
{code}
[Tue Apr 14 13:43:54 root@/mapr/vmarkman.cluster.com/test/smalltable ] # ls -R
.:
2014 2015 2016
./2014:
./2015:
01 02
./2015/01:
t1.csv
./2015/02:
t2.csv
./2016:
t1.csv
[Tue Apr 14 13:44:26 root@/mapr/vmarkman.cluster.com/test/bigtable ] # ls -R
.:
2015 2016
./2015:
01 02 03 04
./2015/01:
0_0_0.parquet 1_0_0.parquet 2_0_0.parquet 3_0_0.parquet 4_0_0.parquet 5_0_0.parquet
./2015/02:
0_0_0.parquet
./2015/03:
0_0_0.parquet
./2015/04:
0_0_0.parquet
./2016:
01 parquet.file
./2016/01:
0_0_0.parquet
{code}
Simple case, partition pruning is happening correctly: only 2016 directory is scanned from 'smalltable'.
{code}
0: jdbc:drill:schema=dfs> explain plan for select * from smalltable where dir0 = maxdir('dfs.test', 'bigtable');
+------------+------------+
| text | json |
+------------+------------+
| 00-00 Screen
00-01 Project(*=[$0])
00-02 Project(*=[$0])
00-03 Scan(groupscan=[EasyGroupScan [selectionRoot=/test/smalltable, numFiles=1, columns=[`*`], files=[maprfs:/test/smalltable/2016/t1.csv]]])
| {
"head" : {
"version" : 1,
"generator" : {
"type" : "ExplainHandler",
"info" : ""
},
"type" : "APACHE_DRILL_PHYSICAL",
"options" : [ ],
"queue" : 0,
"resultMode" : "EXEC"
},
"graph" : [ {
"pop" : "fs-scan",
"@id" : 3,
"files" : [ "maprfs:/test/smalltable/2016/t1.csv" ],
"storage" : {
"type" : "file",
"enabled" : true,
"connection" : "maprfs:///",
"workspaces" : {
"root" : {
"location" : "/",
"writable" : false,
"defaultInputFormat" : null
},
...
...
{code}
With added second predicate (dir1 = mindir('dfs.test', 'bigtable/2016') which evaluates to false (there is no directory '01' in smalltable)
we end up scanning everything in the smalltable. This does not look right to me and I think this is a bug.
{code}
0: jdbc:drill:schema=dfs> explain plan for select * from smalltable where dir0 = maxdir('dfs.test', 'bigtable') and dir1 = mindir('dfs.test', 'bigtable/2016');
+------------+------------+
| text | json |
+------------+------------+
| 00-00 Screen
00-01 Project(*=[$0])
00-02 Project(T15¦¦*=[$0])
00-03 SelectionVectorRemover
00-04 Filter(condition=[AND(=($1, '2016'), =($2, '01'))])
00-05 Project(T15¦¦*=[$0], dir0=[$1], dir1=[$2])
00-06 Scan(groupscan=[EasyGroupScan [selectionRoot=/test/smalltable, numFiles=3, columns=[`*`], files=[maprfs:/test/smalltable/2015/01/t1.csv, maprfs:/test/smalltable/2015/02/t2.csv, maprfs:/test/smalltable/2016/t1.csv]]])
| {
"head" : {
"version" : 1,
"generator" : {
"type" : "ExplainHandler",
"info" : ""
},
"type" : "APACHE_DRILL_PHYSICAL",
"options" : [ ],
"queue" : 0,
"resultMode" : "EXEC"
},
"graph" : [ {
"pop" : "fs-scan",
"@id" : 6,
"files" : [ "maprfs:/test/smalltable/2015/01/t1.csv", "maprfs:/test/smalltable/2015/02/t2.csv", "maprfs:/test/smalltable/2016/t1.csv" ],
"storage" : {
"type" : "file",
"enabled" : true,
"connection" : "maprfs:///",
"workspaces" : {
"root" : {
"location" : "/",
"writable" : false,
"defaultInputFormat" : null
},
...
...
{code}
Here is a similar example with parquet file where predicate "a1=11" evaluates to false.
{code}
0: jdbc:drill:schema=dfs> explain plan for select * from bigtable where dir0=maxdir('dfs.test','bigtable') and a1 = 11;
+------------+------------+
| text | json |
+------------+------------+
| 00-00 Screen
00-01 Project(*=[$0])
00-02 Project(T25¦¦*=[$0])
00-03 SelectionVectorRemover
00-04 Filter(condition=[AND(=($1, '2016'), =($2, 11))])
00-05 Project(T25¦¦*=[$0], dir0=[$1], a1=[$2])
00-06 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/bigtable/2016/01/0_0_0.parquet], ReadEntryWithPath [path=maprfs:/test/bigtable/2016/parquet.file]], selectionRoot=/test/bigtable, numFiles=2, columns=[`*`]]])
| {
"head" : {
"version" : 1,
"generator" : {
"type" : "ExplainHandler",
"info" : ""
},
"type" : "APACHE_DRILL_PHYSICAL",
"options" : [ ],
"queue" : 0,
"resultMode" : "EXEC"
},
"graph" : [ {
"pop" : "parquet-scan",
"@id" : 6,
"entries" : [ {
"path" : "maprfs:/test/bigtable/2016/01/0_0_0.parquet"
}, {
"path" : "maprfs:/test/bigtable/2016/parquet.file"
} ],
{code}
And finally, when we use the same table in the from clause and in maxdir/mindir, we scan only one file (to return schema):
I would think that the same should happen in the bug case above ...
{code}
0: jdbc:drill:schema=dfs> explain plan for select * from bigtable where dir0 = maxdir('dfs.test', 'bigtable') and dir1 = mindir('dfs.test', 'bigtable/2016');
+------------+------------+
| text | json |
+------------+------------+
| 00-00 Screen
00-01 Project(*=[$0])
00-02 Project(T29¦¦*=[$0])
00-03 SelectionVectorRemover
00-04 Filter(condition=[AND(=($1, '2016'), =($2, 'parquet.file'))])
00-05 Project(T29¦¦*=[$0], dir0=[$1], dir1=[$2])
00-06 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/bigtable/2015/01/4_0_0.parquet]], selectionRoot=/test/bigtable, numFiles=1, columns=[`*`]]])
| {
"head" : {
"version" : 1,
"generator" : {
"type" : "ExplainHandler",
"info" : ""
},
"type" : "APACHE_DRILL_PHYSICAL",
"options" : [ ],
"queue" : 0,
"resultMode" : "EXEC"
},
"graph" : [ {
"pop" : "parquet-scan",
"@id" : 6,
"entries" : [ {
"path" : "maprfs:/test/bigtable/2015/01/4_0_0.parquet"
} ],
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)