You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Jinfeng Ni (JIRA)" <ji...@apache.org> on 2014/08/26 01:04:58 UTC

[jira] [Created] (DRILL-1337) TPCH Q13 may return incorrect rows : Drill may incorrectly pull up a local right filter in a left outer join condition.

Jinfeng Ni created DRILL-1337:
---------------------------------

             Summary: TPCH Q13 may return incorrect rows :  Drill may incorrectly pull up a local right filter in a left outer join condition.
                 Key: DRILL-1337
                 URL: https://issues.apache.org/jira/browse/DRILL-1337
             Project: Apache Drill
          Issue Type: Bug
            Reporter: Jinfeng Ni
            Assignee: Jinfeng Ni


For TPCH Q13, Drill may discard some qualified rows, since Drill pulls up a local RHS filter in a left outer join, and put it on top of Join. There means the local RHS is treated as a post-join condition, which will incorrectly discard some qualified rows.

select
  c_count,
  count(*) as custdist
from
  (
    select
      c.c_custkey,
      count(o.o_orderkey)
    from
      cp.`tpch/customer.parquet` c 
      left outer join cp.`tpch/orders.parquet` o 
        on c.c_custkey = o.o_custkey
        and o.o_comment not like '%special%requests%'
    group by
      c.c_custkey
  ) as orders (c_custkey, c_count)
group by
  c_count
order by
  custdist desc,
  c_count desc;


Drill Physical : 
.......................
02-06                              Filter(condition=[$3]): rowcount = 3750.0, cumulative cost = {79500.0 rows, 568512.0 cpu, 0.0 io, 1.90464E8 network, 264000.0 memory}, id = 2649
02-07                                HashJoin(condition=[=($0, $1)], joinType=[left]): rowcount = 15000.0, cumulative cost = {64500.0 rows, 508512.0 cpu, 0.0 io, 1.90464E8 network, 264000.0 memory}, id = 2648
02-09                                  HashToRandomExchange(dist0=[[$0]]): rowcount = 1500.0, cumulative cost = {3000.0 rows, 25500.0 cpu, 0.0 io, 6144000.0 network, 0.0 memory}, id = 2644
03-01                                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/customer.parquet]], selectionRoot=/tpch/customer.parquet, columns=[SchemaPath [`c_custkey`]]]]): rowcount = 1500.0, cumulative cost = {1500.0 rows, 1500.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 2643
02-08                                  HashToRandomExchange(dist0=[[$0]]): rowcount = 15000.0, cumulative cost = {45000.0 rows, 285012.0 cpu, 0.0 io, 1.8432E8 network, 0.0 memory}, id = 2647
04-01                                    Project(o_custkey=[$1], o_orderkey=[$0], $f4=[NOT(LIKE($2, '%special%requests%'))]): rowcount = 15000.0, cumulative cost = {30000.0 rows, 45012.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 2646
04-02                                      Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/orders.parquet]], selectionRoot=/tpch/orders.parquet, columns=[SchemaPath [`o_custkey`], SchemaPath [`o_orderkey`], SchemaPath [`o_comment`]]]]): rowcount = 15000.0, cumulative cost = {15000.0 rows, 45000.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 2645




--
This message was sent by Atlassian JIRA
(v6.2#6252)