You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Jinfeng Ni (JIRA)" <ji...@apache.org> on 2015/03/06 22:50:38 UTC
[jira] [Commented] (DRILL-2400) Drill's query planner did not merge multiple filters into one.

    [ https://issues.apache.org/jira/browse/DRILL-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350970#comment-14350970 ] 

Jinfeng Ni commented on DRILL-2400:
-----------------------------------

[~amansinha100], could you please review the patch for DRILL-2240, in particular, the first part where the estimation of filter's cpu cost is modified? Thanks.



> Drill's query planner did not merge multiple filters into one. 
> ---------------------------------------------------------------
>
>                 Key: DRILL-2400
>                 URL: https://issues.apache.org/jira/browse/DRILL-2400
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>         Attachments: 0005-DRILL-2240-Part-1-Change-cpu-cost-estimation-formula.patch, 0006-DRILL-2240-Part-2-Add-merge-filter-rule-to-logical-r.patch
>
>
> Let's say we have a view defined with a WHERE clause as the following:
> {code}
> "create or replace view MyViewWithFilter as " +
>         " SELECT  first_name, " +
>         "         last_name, " +
>         "         full_name, " +
>         "         salary, " +
>         "         employee_id, " +
>         "         store_id, " +
>         "         position_id, " +
>         "         position_title, " +
>         "         education_level " +
>         " FROM cp.`employee.json` " +
>         " WHERE position_id in (1, 2, 3 ) "
> {code}
> For the following query which refers to the above view in two places, one in out query, one in the subquery, each reference adds an additional filter,
> {code}
> select dat.store_id
>       , sum(dat.store_cost) as total_cost
>  from (
>        select store_id,  
>                   position_id , 
>                   sum( salary) as store_cost
>        from MyViewWithFilter 
>        where full_name in ( select n_name
>                                        from cp.`tpch/nation.parquet`)
>            and  education_level = 'GRADUATE DEGREE'   
>            and position_id in ( select position_id 
>                                           from MyViewWithFilter
>                                           where position_title like '%VP%'
>                                         )
>   group by store_id, position_id
> ) dat
> group by dat.store_id
> order by dat.store_id
> {code}
> The current plan will be like the following :
> {code}
> 00-00    Screen
> 00-01      Project(store_id=[$0], total_cost=[$1])
> 00-02        SelectionVectorRemover
> 00-03          Sort(sort0=[$0], dir0=[ASC])
> 00-04            HashAgg(group=[{0}], total_cost=[SUM($1)])
> 00-05              Project(store_id=[$0], store_cost=[$2])
> 00-06                HashAgg(group=[{0, 1}], store_cost=[SUM($2)])
> 00-07                  Project(store_id=[$1], position_id=[$2], $f3=[$0])
> 00-08                    HashJoin(condition=[=($4, $5)], joinType=[inner])
> 00-10                      Project($f3=[$0], $f5=[$1], $f6=[$2], $f8=[$3], $f11=[$2])
> 00-12                        HashJoin(condition=[=($4, $5)], joinType=[inner])
> 00-15                          SelectionVectorRemover
> 00-18                            Filter(condition=[=($3, 'GRADUATE DEGREE')])
> 00-20                              Project($f3=[CAST($2):ANY NOT NULL], $f5=[CAST($3):ANY NOT NULL], $f6=[CAST($0):ANY NOT NULL], $f8=[CAST($4):ANY NOT NULL], $f9=[CAST($1):ANY NOT NULL])
> 00-22                                SelectionVectorRemover
> 00-23                                  Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3))])
> 00-24                                    Project(position_id=[$0], full_name=[$4], salary=[$1], store_id=[$2], education_level=[$3])
> 00-25                                      Scan(groupscan=[EasyGroupScan [selectionRoot=/employee.json, numFiles=1, columns=[`position_id`, `full_name`, `salary`, `store_id`, `education_level`], files=[/employee.json]]])
> 00-14                          HashAgg(group=[{0}])
> 00-17                            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]])
> 00-09                      HashAgg(group=[{0}])
> 00-11                        Project(position_id=[CAST($0):ANY NOT NULL])
> 00-13                          SelectionVectorRemover
> 00-16                            Filter(condition=[LIKE(CAST($1):ANY NOT NULL, '%VP%')])
> 00-19                              Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3))])
> 00-21                                Scan(groupscan=[EasyGroupScan [selectionRoot=/employee.json, numFiles=1, columns=[`position_id`, `position_title`], files=[/employee.json]]])
> {code}
> In the plan,  the View's own filter and the additional filter were not merged, and we could see either in the out query or subquery, there are two filters, like :
> {code}
> 00-16                            Filter(condition=[LIKE(CAST($1):ANY NOT NULL, '%VP%')])
> 00-19                              Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3))])
> {code}
> It is desired to have filter merged, because 
> 1) it could improve the possibility of partition pruning with the merged filter, since the partition pruning will match a filter directly on top of SCAN.
>    Filter   (dir0 = ' ....') 
>       \     
>    Filter   ( colA = '...')
>           \
>        SCAN
> If the filter with (dir0) is not merged with the second filter, then partition pruning will not kick in
> 2) Merged filter would allow short circuit evaluation, which could mean better performance, compared with the case where we evaluate the filter one by one.
> There are two reasons why filters are not merged:
> 1) Drill's planner does not enable the rule of MergerFilter rule
> 2) Even if we enable MergeFilter rule, in some situation, we still could not get filter merged, because the cpu cost estimation for filter is a bit off the real cost.
> In this JIRA, we'll 
> 1) Enable MergeFilter (Drill's version)
> 2) Fix the cost estimation formula for Filter operator.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)