You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Victoria Markman (JIRA)" <ji...@apache.org> on 2016/02/15 18:17:18 UTC
[jira] [Commented] (DRILL-4201) DrillPushFilterPastProject should allow partial filter pushdown.

    [ https://issues.apache.org/jira/browse/DRILL-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147610#comment-15147610 ] 

Victoria Markman commented on DRILL-4201:
-----------------------------------------

I verified that partial filter is getting pushed down, however it is not going to happen always. It depends on the costing and heuristic there is a bit tricky.

In the case below, filter is not going to be pushed pass project, because file vicky.json contains only 2 rows:
{code}
0: jdbc:drill:schema=dfs> explain plan for select
. . . . . . . . . . . . >         *
. . . . . . . . . . . . > from
. . . . . . . . . . . . >         hive.lineitem_text_hive l
. . . . . . . . . . . . >         inner join
. . . . . . . . . . . . >         ( select
. . . . . . . . . . . . >                 flatten(test)   as test,
. . . . . . . . . . . . >                 o_orderkey      as orderkey
. . . . . . . . . . . . >         from
. . . . . . . . . . . . >                 dfs.`/drill/testdata/Tpch0.01/json/orders/vicky.json`) as o
. . . . . . . . . . . . > on ( l.l_orderkey = o.orderkey )
. . . . . . . . . . . . > where test = 1 and o.orderkey = 22;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(l_orderkey=[$0], l_partkey=[$1], l_suppkey=[$2], l_linenumber=[$3], l_quantity=[$4], l_extendedprice=[$5], l_discount=[$6], l_tax=[$7], l_returnflag=[$8], l_linestatus=[$9], l_shipdate=[$10], l_commitdate=[$11], l_receiptdate=[$12], l_shipinstruct=[$13], l_shipmode=[$14], l_comment=[$15], test=[$16], orderkey=[$17])
00-02        Project(l_orderkey=[$0], l_partkey=[$1], l_suppkey=[$2], l_linenumber=[$3], l_quantity=[$4], l_extendedprice=[$5], l_discount=[$6], l_tax=[$7], l_returnflag=[$8], l_linestatus=[$9], l_shipdate=[$10], l_commitdate=[$11], l_receiptdate=[$12], l_shipinstruct=[$13], l_shipmode=[$14], l_comment=[$15], test=[$16], orderkey=[$17])
00-03          HashJoin(condition=[=($0, $17)], joinType=[inner])
00-05            Scan(groupscan=[HiveScan [table=Table(dbName:default, tableName:lineitem_text_hive), columns=[`*`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/drill/testdata/partition_pruning/hive/text/lineitem]]])
00-04            SelectionVectorRemover
00-06              Filter(condition=[AND(=($0, 1), =($1, 22))])
00-07                Flatten(flattenField=[$0])
00-08                  Project(test=[$1], orderkey=[$0])
00-09                    Scan(groupscan=[EasyGroupScan [selectionRoot=maprfs:/drill/testdata/Tpch0.01/json/orders/vicky.json, numFiles=1, columns=[`test`, `o_orderkey`], files=[maprfs:///drill/testdata/Tpch0.01/json/orders/vicky.json]]])
{code}

It's not going to be pushed pass project even if I add 40 columns to be projected (json file with 2 rows):
{code}
0: jdbc:drill:schema=dfs> explain plan for select
. . . . . . . . . . . . >         *
. . . . . . . . . . . . > from
. . . . . . . . . . . . >         hive.lineitem_text_hive l
. . . . . . . . . . . . >         inner join
. . . . . . . . . . . . >         ( select
. . . . . . . . . . . . >                 flatten(test)           as test,
. . . . . . . . . . . . >                 o_orderkey              as orderkey,
. . . . . . . . . . . . >                 o_orderkey + 1          as o1,
. . . . . . . . . . . . >                 o_orderkey + 2          as o2,
. . . . . . . . . . . . >                 o_orderkey + 3          as o3,
. . . . . . . . . . . . >                 o_orderkey + 4          as o4,
. . . . . . . . . . . . >                 o_orderkey + 5          as o5,
. . . . . . . . . . . . >                 o_orderkey + 6          as o6,
. . . . . . . . . . . . >                 o_orderkey + 7          as o7,
. . . . . . . . . . . . >                 o_orderkey + 8          as o8,
. . . . . . . . . . . . >                 o_orderkey + 9          as o9,
. . . . . . . . . . . . >                 o_orderkey + 10         as o10,
. . . . . . . . . . . . >                 o_orderkey + 11         as o11,
. . . . . . . . . . . . >                 o_orderkey + 12         as o12,
. . . . . . . . . . . . >                 o_orderkey + 13         as o13,
. . . . . . . . . . . . >                 o_orderkey + 14         as o14,
. . . . . . . . . . . . >                 o_orderkey + 15         as o15,
. . . . . . . . . . . . >                 o_orderkey + 16         as o16,
. . . . . . . . . . . . >                 o_orderkey + 17         as o17,
. . . . . . . . . . . . >                 o_orderkey + 18         as o18,
. . . . . . . . . . . . >                 o_orderkey + 19         as o19,
. . . . . . . . . . . . >                 o_orderkey + 20         as o20,
. . . . . . . . . . . . >                 o_orderkey + 21         as o21,
. . . . . . . . . . . . >                 o_orderkey + 22         as o22,
. . . . . . . . . . . . >                 o_orderkey + 23         as o23,
. . . . . . . . . . . . >                 o_orderkey + 24         as o24,
. . . . . . . . . . . . >                 o_orderkey + 25         as o25,
. . . . . . . . . . . . >                 o_orderkey + 26         as o26,
. . . . . . . . . . . . >                 o_orderkey + 27         as o27,
. . . . . . . . . . . . >                 o_orderkey + 28         as o28,
. . . . . . . . . . . . >                 o_orderkey + 29         as o29,
. . . . . . . . . . . . >                 o_orderkey + 30         as o30,
. . . . . . . . . . . . >                 o_orderkey + 31         as o31,
. . . . . . . . . . . . >                 o_orderkey + 32         as o32,
. . . . . . . . . . . . >                 o_orderkey + 33         as o33,
. . . . . . . . . . . . >                 o_orderkey + 34         as o34,
. . . . . . . . . . . . >                 o_orderkey + 35         as o35,
. . . . . . . . . . . . >                 o_orderkey + 36         as o36,
. . . . . . . . . . . . >                 o_orderkey + 37         as o37,
. . . . . . . . . . . . >                 o_orderkey + 38         as o38,
. . . . . . . . . . . . >                 o_orderkey + 39         as o39,
. . . . . . . . . . . . >                 o_orderkey + 40         as o40
. . . . . . . . . . . . >         from
. . . . . . . . . . . . >                 dfs.`/drill/testdata/Tpch0.01/json/orders/vicky.json`) as o
. . . . . . . . . . . . > on ( l.l_orderkey = o.orderkey)
. . . . . . . . . . . . > where test = 1 and o.orderkey = 22;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(l_orderkey=[$0], l_partkey=[$1], l_suppkey=[$2], l_linenumber=[$3], l_quantity=[$4], l_extendedprice=[$5], l_discount=[$6], l_tax=[$7], l_returnflag=[$8], l_linestatus=[$9], l_shipdate=[$10], l_commitdate=[$11], l_receiptdate=[$12], l_shipinstruct=[$13], l_shipmode=[$14], l_comment=[$15], test=[$16], orderkey=[$17], o1=[$18], o2=[$19], o3=[$20], o4=[$21], o5=[$22], o6=[$23], o7=[$24], o8=[$25], o9=[$26], o10=[$27], o11=[$28], o12=[$29], o13=[$30], o14=[$31], o15=[$32], o16=[$33], o17=[$34], o18=[$35], o19=[$36], o20=[$37], o21=[$38], o22=[$39], o23=[$40], o24=[$41], o25=[$42], o26=[$43], o27=[$44], o28=[$45], o29=[$46], o30=[$47], o31=[$48], o32=[$49], o33=[$50], o34=[$51], o35=[$52], o36=[$53], o37=[$54], o38=[$55], o39=[$56], o40=[$57])
00-02        Project(l_orderkey=[$0], l_partkey=[$1], l_suppkey=[$2], l_linenumber=[$3], l_quantity=[$4], l_extendedprice=[$5], l_discount=[$6], l_tax=[$7], l_returnflag=[$8], l_linestatus=[$9], l_shipdate=[$10], l_commitdate=[$11], l_receiptdate=[$12], l_shipinstruct=[$13], l_shipmode=[$14], l_comment=[$15], test=[$16], orderkey=[$17], o1=[$18], o2=[$19], o3=[$20], o4=[$21], o5=[$22], o6=[$23], o7=[$24], o8=[$25], o9=[$26], o10=[$27], o11=[$28], o12=[$29], o13=[$30], o14=[$31], o15=[$32], o16=[$33], o17=[$34], o18=[$35], o19=[$36], o20=[$37], o21=[$38], o22=[$39], o23=[$40], o24=[$41], o25=[$42], o26=[$43], o27=[$44], o28=[$45], o29=[$46], o30=[$47], o31=[$48], o32=[$49], o33=[$50], o34=[$51], o35=[$52], o36=[$53], o37=[$54], o38=[$55], o39=[$56], o40=[$57])
00-03          HashJoin(condition=[=($0, $17)], joinType=[inner])
00-05            Scan(groupscan=[HiveScan [table=Table(dbName:default, tableName:lineitem_text_hive), columns=[`*`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/drill/testdata/partition_pruning/hive/text/lineitem]]])
00-04            SelectionVectorRemover
00-06              Filter(condition=[AND(=($0, 1), =($1, 22))])
00-07                Flatten(flattenField=[$0])
00-08                  Project(test=[$1], orderkey=[$0], o1=[+($0, 1)], o2=[+($0, 2)], o3=[+($0, 3)], o4=[+($0, 4)], o5=[+($0, 5)], o6=[+($0, 6)], o7=[+($0, 7)], o8=[+($0, 8)], o9=[+($0, 9)], o10=[+($0, 10)], o11=[+($0, 11)], o12=[+($0, 12)], o13=[+($0, 13)], o14=[+($0, 14)], o15=[+($0, 15)], o16=[+($0, 16)], o17=[+($0, 17)], o18=[+($0, 18)], o19=[+($0, 19)], o20=[+($0, 20)], o21=[+($0, 21)], o22=[+($0, 22)], o23=[+($0, 23)], o24=[+($0, 24)], o25=[+($0, 25)], o26=[+($0, 26)], o27=[+($0, 27)], o28=[+($0, 28)], o29=[+($0, 29)], o30=[+($0, 30)], o31=[+($0, 31)], o32=[+($0, 32)], o33=[+($0, 33)], o34=[+($0, 34)], o35=[+($0, 35)], o36=[+($0, 36)], o37=[+($0, 37)], o38=[+($0, 38)], o39=[+($0, 39)], o40=[+($0, 40)])
00-09                    Scan(groupscan=[EasyGroupScan [selectionRoot=maprfs:/drill/testdata/Tpch0.01/json/orders/vicky.json, numFiles=1, columns=[`test`, `o_orderkey`], files=[maprfs:///drill/testdata/Tpch0.01/json/orders/vicky.json]]])
{code}

Filter pushdown is triggered when size of the table increased to 300+ rows:
{code}
0: jdbc:drill:schema=dfs> explain plan for select
. . . . . . . . . . . . >         *
. . . . . . . . . . . . > from
. . . . . . . . . . . . >         hive.lineitem_text_hive l
. . . . . . . . . . . . >         inner join
. . . . . . . . . . . . >         ( select
. . . . . . . . . . . . >                 flatten(test)           as test,
. . . . . . . . . . . . >                 o_orderkey              as orderkey
. . . . . . . . . . . . > from
. . . . . . . . . . . . >                 dfs.`/drill/testdata/Tpch0.01/json/orders/orders_1.json`) as o
. . . . . . . . . . . . > on ( l.l_orderkey = o.orderkey )
. . . . . . . . . . . . > where test = 1 and o.orderkey = 22;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(l_orderkey=[$0], l_partkey=[$1], l_suppkey=[$2], l_linenumber=[$3], l_quantity=[$4], l_extendedprice=[$5], l_discount=[$6], l_tax=[$7], l_returnflag=[$8], l_linestatus=[$9], l_shipdate=[$10], l_commitdate=[$11], l_receiptdate=[$12], l_shipinstruct=[$13], l_shipmode=[$14], l_comment=[$15], test=[$16], orderkey=[$17])
00-02        Project(l_orderkey=[$0], l_partkey=[$1], l_suppkey=[$2], l_linenumber=[$3], l_quantity=[$4], l_extendedprice=[$5], l_discount=[$6], l_tax=[$7], l_returnflag=[$8], l_linestatus=[$9], l_shipdate=[$10], l_commitdate=[$11], l_receiptdate=[$12], l_shipinstruct=[$13], l_shipmode=[$14], l_comment=[$15], test=[$16], orderkey=[$17])
00-03          HashJoin(condition=[=($0, $17)], joinType=[inner])
00-05            Scan(groupscan=[HiveScan [table=Table(dbName:default, tableName:lineitem_text_hive), columns=[`*`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/drill/testdata/partition_pruning/hive/text/lineitem]]])
00-04            SelectionVectorRemover
00-06              Filter(condition=[=($0, 1)])
00-07                Flatten(flattenField=[$0])
00-08                  Project(test=[$0], orderkey=[$1])
00-09                    SelectionVectorRemover
00-10                      Filter(condition=[=($1, 22)])
00-11                        Project(test=[$1], o_orderkey=[$0])
00-12                          Scan(groupscan=[EasyGroupScan [selectionRoot=maprfs:/drill/testdata/Tpch0.01/json/orders/orders_1.json, numFiles=1, columns=[`test`, `o_orderkey`], files=[maprfs:///drill/testdata/Tpch0.01/json/orders/orders_1.json]]])
{code}

I will add this test case to the test suite. It is different from Jinfeng's unit tests, he is testing with directory labels, which is a bit different code pass.
In my opinion, we should revisit test coverage in this area after we switch to HEP planner (DRILL-3996).  Will file a separate enhancement for that.

Consider this fixed and verified in:
{code}
#Generated by Git-Commit-Id-Plugin
#Sat Feb 13 00:36:18 UTC 2016
git.commit.id.abbrev=0a2518d
git.commit.user.email=jni@apache.org
git.commit.message.full=DRILL-4363\: Row count based pruning for parquet table used in Limit n query.\n\nModify two existint unit testcase\:\n1) TestPartitionFilter.testMainQueryFalseCondition()\: rowCount pruning applied after false condition is transformed into LIMIT 0\n2) TestLimitWithExchanges.testPushLimitPastUnionExchange()\: modify the testcase to use Json source, so that it does not mix with PushLimitIntoScanRule.\n
git.commit.id=0a2518d7cf01a92a27a82e29edac5424bedf31d5
git.commit.message.short=DRILL-4363\: Row count based pruning for parquet table used in Limit n query.
git.commit.user.name=Jinfeng Ni
git.build.user.name=vmarkman
git.commit.id.describe=0.9.0-602-g0a2518d
git.build.user.email=vmarkman@maprtech.com
git.branch=0a2518d7cf01a92a27a82e29edac5424bedf31d5
git.commit.time=11.02.2016 @ 23\:01\:15 UTC
git.build.time=13.02.2016 @ 00\:36\:18 UTC
git.remote.origin.url=git@github.com\:apache/drill.git
{code}

> DrillPushFilterPastProject should allow partial filter pushdown. 
> -----------------------------------------------------------------
>
>                 Key: DRILL-4201
>                 URL: https://issues.apache.org/jira/browse/DRILL-4201
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>             Fix For: 1.5.0
>
>
> Currently, DrillPushFilterPastProjectRule will stop pushing the filter down, if the filter itself has ITEM or FLATTEN function, or its input reference is referring to an ITEM or FLATTEN function. However, in case that the filter is a conjunction of multiple sub-filters, some of them refer to ITEM  or FLATTEN but the other not, then we should allow partial filter to be pushed down. For instance,
> WHERE  partition_col > 10 and flatten_output_col = 'ABC'. 
> The "flatten_output_col" comes from the output of FLATTEN operator, and therefore flatten_output_col = 'ABC' should not pushed past the project. But partiion_col > 10 should be pushed down, such that we could trigger the pruning rule to apply partition pruning.
> It would be improve Drill query performance, when the partially pushed filter leads to partition pruning, or the partially pushed filter results in early filtering in upstream operator. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)