You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Jinfeng Ni (JIRA)" <ji...@apache.org> on 2015/03/06 22:38:38 UTC
[jira] [Created] (DRILL-2400) Drill's query planner did not merge multiple filters into one.

Jinfeng Ni created DRILL-2400:
---------------------------------

             Summary: Drill's query planner did not merge multiple filters into one. 
                 Key: DRILL-2400
                 URL: https://issues.apache.org/jira/browse/DRILL-2400
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning & Optimization
            Reporter: Jinfeng Ni
            Assignee: Jinfeng Ni


Let's say we have a view defined with a WHERE clause as the following:

{code}
"create or replace view MyViewWithFilter as " +
        " SELECT  first_name, " +
        "         last_name, " +
        "         full_name, " +
        "         salary, " +
        "         employee_id, " +
        "         store_id, " +
        "         position_id, " +
        "         position_title, " +
        "         education_level " +
        " FROM cp.`employee.json` " +
        " WHERE position_id in (1, 2, 3 ) "
{code}

For the following query which refers to the above view in two places, one in out query, one in the subquery, each reference adds an additional filter,

{code}
select dat.store_id
      , sum(dat.store_cost) as total_cost
 from (
       select store_id,  
                  position_id , 
                  sum( salary) as store_cost
       from MyViewWithFilter 
       where full_name in ( select n_name
                                       from cp.`tpch/nation.parquet`)
           and  education_level = 'GRADUATE DEGREE'   
           and position_id in ( select position_id 
                                          from MyViewWithFilter
                                          where position_title like '%VP%'
                                        )
  group by store_id, position_id
) dat
group by dat.store_id
order by dat.store_id
{code}

The current plan will be like the following :
{code}
00-00    Screen
00-01      Project(store_id=[$0], total_cost=[$1])
00-02        SelectionVectorRemover
00-03          Sort(sort0=[$0], dir0=[ASC])
00-04            HashAgg(group=[{0}], total_cost=[SUM($1)])
00-05              Project(store_id=[$0], store_cost=[$2])
00-06                HashAgg(group=[{0, 1}], store_cost=[SUM($2)])
00-07                  Project(store_id=[$1], position_id=[$2], $f3=[$0])
00-08                    HashJoin(condition=[=($4, $5)], joinType=[inner])
00-10                      Project($f3=[$0], $f5=[$1], $f6=[$2], $f8=[$3], $f11=[$2])
00-12                        HashJoin(condition=[=($4, $5)], joinType=[inner])
00-15                          SelectionVectorRemover
00-18                            Filter(condition=[=($3, 'GRADUATE DEGREE')])
00-20                              Project($f3=[CAST($2):ANY NOT NULL], $f5=[CAST($3):ANY NOT NULL], $f6=[CAST($0):ANY NOT NULL], $f8=[CAST($4):ANY NOT NULL], $f9=[CAST($1):ANY NOT NULL])
00-22                                SelectionVectorRemover
00-23                                  Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3))])
00-24                                    Project(position_id=[$0], full_name=[$4], salary=[$1], store_id=[$2], education_level=[$3])
00-25                                      Scan(groupscan=[EasyGroupScan [selectionRoot=/employee.json, numFiles=1, columns=[`position_id`, `full_name`, `salary`, `store_id`, `education_level`], files=[/employee.json]]])
00-14                          HashAgg(group=[{0}])
00-17                            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]])
00-09                      HashAgg(group=[{0}])
00-11                        Project(position_id=[CAST($0):ANY NOT NULL])
00-13                          SelectionVectorRemover
00-16                            Filter(condition=[LIKE(CAST($1):ANY NOT NULL, '%VP%')])
00-19                              Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3))])
00-21                                Scan(groupscan=[EasyGroupScan [selectionRoot=/employee.json, numFiles=1, columns=[`position_id`, `position_title`], files=[/employee.json]]])
{code}

In the plan,  the View's own filter and the additional filter were not merged, and we could see either in the out query or subquery, there are two filters, like :
{code}
00-16                            Filter(condition=[LIKE(CAST($1):ANY NOT NULL, '%VP%')])
00-19                              Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3))])
{code}

It is desired to have filter merged, because 
1) it could improve the possibility of partition pruning with the merged filter, since the partition pruning will match a filter directly on top of SCAN.
   Filter   (dir0 = ' ....') 
      \     
   Filter   ( colA = '...')
          \
       SCAN
If the filter with (dir0) is not merged with the second filter, then partition pruning will not kick in

2) Merged filter would allow short circuit evaluation, which could mean better performance, compared with the case where we evaluate the filter one by one.

There are two reasons why filters are not merged:
1) Drill's planner does not enable the rule of MergerFilter rule
2) Even if we enable MergeFilter rule, in some situation, we still could not get filter merged, because the cpu cost estimation for filter is a bit off the real cost.

In this JIRA, we'll 

1) Enable MergeFilter (Drill's version)
2) Fix the cost estimation formula for Filter operator.

 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)