You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Jinfeng Ni (JIRA)" <ji...@apache.org> on 2015/03/06 22:50:38 UTC
[jira] [Commented] (DRILL-2400) Drill's query planner did not merge
multiple filters into one.
[ https://issues.apache.org/jira/browse/DRILL-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350970#comment-14350970 ]
Jinfeng Ni commented on DRILL-2400:
-----------------------------------
[~amansinha100], could you please review the patch for DRILL-2240, in particular, the first part where the estimation of filter's cpu cost is modified? Thanks.
> Drill's query planner did not merge multiple filters into one.
> ---------------------------------------------------------------
>
> Key: DRILL-2400
> URL: https://issues.apache.org/jira/browse/DRILL-2400
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Reporter: Jinfeng Ni
> Assignee: Jinfeng Ni
> Attachments: 0005-DRILL-2240-Part-1-Change-cpu-cost-estimation-formula.patch, 0006-DRILL-2240-Part-2-Add-merge-filter-rule-to-logical-r.patch
>
>
> Let's say we have a view defined with a WHERE clause as the following:
> {code}
> "create or replace view MyViewWithFilter as " +
> " SELECT first_name, " +
> " last_name, " +
> " full_name, " +
> " salary, " +
> " employee_id, " +
> " store_id, " +
> " position_id, " +
> " position_title, " +
> " education_level " +
> " FROM cp.`employee.json` " +
> " WHERE position_id in (1, 2, 3 ) "
> {code}
> For the following query which refers to the above view in two places, one in out query, one in the subquery, each reference adds an additional filter,
> {code}
> select dat.store_id
> , sum(dat.store_cost) as total_cost
> from (
> select store_id,
> position_id ,
> sum( salary) as store_cost
> from MyViewWithFilter
> where full_name in ( select n_name
> from cp.`tpch/nation.parquet`)
> and education_level = 'GRADUATE DEGREE'
> and position_id in ( select position_id
> from MyViewWithFilter
> where position_title like '%VP%'
> )
> group by store_id, position_id
> ) dat
> group by dat.store_id
> order by dat.store_id
> {code}
> The current plan will be like the following :
> {code}
> 00-00 Screen
> 00-01 Project(store_id=[$0], total_cost=[$1])
> 00-02 SelectionVectorRemover
> 00-03 Sort(sort0=[$0], dir0=[ASC])
> 00-04 HashAgg(group=[{0}], total_cost=[SUM($1)])
> 00-05 Project(store_id=[$0], store_cost=[$2])
> 00-06 HashAgg(group=[{0, 1}], store_cost=[SUM($2)])
> 00-07 Project(store_id=[$1], position_id=[$2], $f3=[$0])
> 00-08 HashJoin(condition=[=($4, $5)], joinType=[inner])
> 00-10 Project($f3=[$0], $f5=[$1], $f6=[$2], $f8=[$3], $f11=[$2])
> 00-12 HashJoin(condition=[=($4, $5)], joinType=[inner])
> 00-15 SelectionVectorRemover
> 00-18 Filter(condition=[=($3, 'GRADUATE DEGREE')])
> 00-20 Project($f3=[CAST($2):ANY NOT NULL], $f5=[CAST($3):ANY NOT NULL], $f6=[CAST($0):ANY NOT NULL], $f8=[CAST($4):ANY NOT NULL], $f9=[CAST($1):ANY NOT NULL])
> 00-22 SelectionVectorRemover
> 00-23 Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3))])
> 00-24 Project(position_id=[$0], full_name=[$4], salary=[$1], store_id=[$2], education_level=[$3])
> 00-25 Scan(groupscan=[EasyGroupScan [selectionRoot=/employee.json, numFiles=1, columns=[`position_id`, `full_name`, `salary`, `store_id`, `education_level`], files=[/employee.json]]])
> 00-14 HashAgg(group=[{0}])
> 00-17 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]])
> 00-09 HashAgg(group=[{0}])
> 00-11 Project(position_id=[CAST($0):ANY NOT NULL])
> 00-13 SelectionVectorRemover
> 00-16 Filter(condition=[LIKE(CAST($1):ANY NOT NULL, '%VP%')])
> 00-19 Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3))])
> 00-21 Scan(groupscan=[EasyGroupScan [selectionRoot=/employee.json, numFiles=1, columns=[`position_id`, `position_title`], files=[/employee.json]]])
> {code}
> In the plan, the View's own filter and the additional filter were not merged, and we could see either in the out query or subquery, there are two filters, like :
> {code}
> 00-16 Filter(condition=[LIKE(CAST($1):ANY NOT NULL, '%VP%')])
> 00-19 Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3))])
> {code}
> It is desired to have filter merged, because
> 1) it could improve the possibility of partition pruning with the merged filter, since the partition pruning will match a filter directly on top of SCAN.
> Filter (dir0 = ' ....')
> \
> Filter ( colA = '...')
> \
> SCAN
> If the filter with (dir0) is not merged with the second filter, then partition pruning will not kick in
> 2) Merged filter would allow short circuit evaluation, which could mean better performance, compared with the case where we evaluate the filter one by one.
> There are two reasons why filters are not merged:
> 1) Drill's planner does not enable the rule of MergerFilter rule
> 2) Even if we enable MergeFilter rule, in some situation, we still could not get filter merged, because the cpu cost estimation for filter is a bit off the real cost.
> In this JIRA, we'll
> 1) Enable MergeFilter (Drill's version)
> 2) Fix the cost estimation formula for Filter operator.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)