You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Jesus Camacho Rodriguez (JIRA)" <ji...@apache.org> on 2017/01/16 11:07:26 UTC

[jira] [Created] (HIVE-15636) Hive/Druid integration: wrong semantics of topN query limit with granularity

Jesus Camacho Rodriguez created HIVE-15636:
----------------------------------------------

             Summary: Hive/Druid integration: wrong semantics of topN query limit with granularity
                 Key: HIVE-15636
                 URL: https://issues.apache.org/jira/browse/HIVE-15636
             Project: Hive
          Issue Type: Bug
          Components: Druid integration
    Affects Versions: 2.2.0
            Reporter: Jesus Camacho Rodriguez
            Assignee: Jesus Camacho Rodriguez
            Priority: Critical


Semantics of Druid topN query with limit and granularity is not equivalent to input SQL. In particular, limit is applied on each granularity value, not on the overall query.

Currently, the following query will be transformed into a topN query:
{code:sql}
SELECT i_brand_id, floor_day(`__time`), max(ss_quantity), sum(ss_wholesale_cost) as s
FROM store_sales_sold_time_subset
GROUP BY i_brand_id, floor_day(`__time`)
ORDER BY s DESC
LIMIT 10;
OK
Plan optimized by CBO.

Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Map 1 vectorized
      File Output Operator [FS_4]
        Select Operator [SEL_3] (rows=15888 width=0)
          Output:["_col0","_col1","_col2","_col3"]
          TableScan [TS_0] (rows=15888 width=0)
            tpcds_druid_10@store_sales_sold_time_subset,store_sales_sold_time_subset,Tbl:PARTIAL,Col:NONE,Output:["i_brand_id","__time","$f2","$f3"],properties:{"druid.query.json":"{\"queryType\":\"topN\",\"dataSource\":\"druid_tpcds_ss_sold_time_subset\",\"granularity\":\"DAY\",\"dimension\":\"i_brand_id\",\"metric\":\"$f3\",\"aggregations\":[{\"type\":\"longMax\",\"name\":\"$f2\",\"fieldName\":\"ss_quantity\"},{\"type\":\"doubleSum\",\"name\":\"$f3\",\"fieldName\":\"ss_wholesale_cost\"}],\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"],\"threshold\":10}","druid.query.type":"topN"}
{code}

It outputs 300 rows, 10 per day. In turn, the equivalent SQL query for a Druid topN query should be expressed as:
{code:sql}
SELECT rs.i_brand_id, rs.d, rs.m, rs.s
FROM (
    SELECT i_brand_id, floor_day(`__time`) as d, max(ss_quantity) as m, sum(ss_wholesale_cost) as s,
           ROW_NUMBER() OVER (PARTITION BY floor_day(`__time`) ORDER BY sum(ss_wholesale_cost) DESC ) AS rownum
    FROM store_sales_sold_time_subset
    GROUP BY i_brand_id, floor_day(`__time`)
) rs
WHERE rownum <= 10;
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)