You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Harish Butani (JIRA)" <ji...@apache.org> on 2013/11/22 21:55:42 UTC

[jira] [Created] (HIVE-5873) SubQuery: In subquery Count Bug

Harish Butani created HIVE-5873:
-----------------------------------

             Summary: SubQuery: In subquery Count Bug
                 Key: HIVE-5873
                 URL: https://issues.apache.org/jira/browse/HIVE-5873
             Project: Hive
          Issue Type: Bug
          Components: Query Processor
            Reporter: Harish Butani


This is from the Optimization of Nested SQl Queries Revisited paper: http://dl.acm.org/citation.cfm?id=38723

Consider Part table having:
{noformat}
PNum OrderOnHand
-------- ------------------
3          6
10        1
8          0
{noformat}

Supply table having:
{noformat}
PNum  Qty  
3          4
3          2
10        1
{noformat}

The query:
{noformat}
select pnum
from parts p
where orderOnHand
     in (select count(*) from supply s
          where s.pnum = p.pnum
         )
{noformat}

should return the row with PNum=8.
But a transformation to a semi-join would eliminate this row, as there are no rows in supply table with PNum=8.

AS shown in the paper the soln is to transform to:
{noformat}
select pnum
from parts p semijoin
        (select p1.pnum, count(*) as c
          from (select distinct pnum from parts) p1 join supply s
          where s.pnum = p1.pnum
         ) sq on p.pnum = sq.pnum and p.orderOnHand = sq.c
{noformat}

The additional distinct query within the SubQuery is to handle duplicates in the outer query on the joining columns.



--
This message was sent by Atlassian JIRA
(v6.1#6144)