You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Andries Engelbrecht (JIRA)" <ji...@apache.org> on 2018/08/23 20:30:00 UTC
[jira] [Created] (DRILL-6708) Flatten operator executes twice on subquery resulting in cartesian of flatten columns, when final query has 2 columns using the flatten column in original query

Andries Engelbrecht created DRILL-6708:
------------------------------------------

             Summary: Flatten operator executes twice on subquery resulting in cartesian of flatten columns, when final query has 2 columns using the flatten column in original query
                 Key: DRILL-6708
                 URL: https://issues.apache.org/jira/browse/DRILL-6708
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning &amp; Optimization
    Affects Versions: 1.13.0
            Reporter: Andries Engelbrecht
         Attachments: campaignclicks_50.json

The following query with subquery and referencing the flatten column twice in final result, ends up with 1137195 rows vs the expected 140913 rows.

 
{code:java}
      SELECT
( `Third`.`cust_id`),
( `Third`.`device`),
( `Third`.`prod_id`)
( `Third`.`prod_id`) AS `prod_id2`
FROM (
 SELECT
 ( `Second`.`cust_id`),
 ( `Second`.`device`),
 ( `Second`.`prod_id`)
 FROM
 ( SELECT
 ( `First`.`cust_id`),
 ( `First`.`device`),
 ( `First`.`prod_id`)
 FROM
 `dfs.views`.`clicks_campaign_vw` AS `First` 
 ) AS `Second`
) AS `Third`        {code}
 
This executed against Drill View listed below
 
{code:java}
CREATE or REPLACE VIEW dfs.views.clicks_campaign_vw AS
SELECT CAST(`t`.`trans_id` as BIGINT) as trans_id, CAST(`t`.`date` AS DATE) AS `thedate`,
CAST(`t`.`user_info`['cust_id'] AS BIGINT) AS `cust_id`,
CAST(`t`.`user_info`['device'] AS VARCHAR(20)) AS `device`,
CAST(`t`.`user_info`['state'] AS VARCHAR(2)) AS `custstate`,
CAST(FLATTEN(`t`.`trans_info`['prod_id']) AS BIGINT) AS `prod_id`,
CAST(`t`.`trans_info`['purch_flag'] AS VARCHAR(6)) AS `purch_flag`
FROM `dfs`.`clicks`.`campaignclicks_50.json` AS `t`
WHERE `t`.`trans_info`['prod_id'][0] IS NOT NULL;{code}
Below is the query plan showing FLATTEN invoked twice
 
{code:java}
00-00 Screen : rowType = RecordType(BIGINT cust_id, VARCHAR(20) device, BIGINT prod_id, BIGINT prod_id2): rowcount = 7089.3, cumulative cost = {73965.03 rows, 337056.82999999996 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901702 00-01 ComplexToJson : rowType = RecordType(BIGINT cust_id, VARCHAR(20) device, BIGINT prod_id, BIGINT prod_id2): rowcount = 7089.3, cumulative cost = {73256.1 rows, 336347.89999999997 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901701 00-02 Project(cust_id=[CAST($0):BIGINT], device=[CAST($1):VARCHAR(20) CHARACTER SET "UTF-16LE" COLLATE "UTF-16LE$en_US$primary"], prod_id=[CAST($3):BIGINT], prod_id2=[CAST($4):BIGINT]) : rowType = RecordType(BIGINT cust_id, VARCHAR(20) device, BIGINT prod_id, BIGINT prod_id2): rowcount = 7089.3, cumulative cost = {66166.8 rows, 329258.6 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901700 00-03 Flatten(flattenField=[$4]) : rowType = RecordType(ANY EXPR$0, ANY EXPR$1, ANY EXPR$2, ANY EXPR$4, ANY EXPR$5): rowcount = 7089.3, cumulative cost = {59077.50000000001 rows, 215829.8 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901699 00-04 Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$4=[$3], EXPR$5=[$2]) : rowType = RecordType(ANY EXPR$0, ANY EXPR$1, ANY EXPR$2, ANY EXPR$4, ANY EXPR$5): rowcount = 7089.3, cumulative cost = {51988.200000000004 rows, 208740.5 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901698 00-05 Flatten(flattenField=[$3]) : rowType = RecordType(ANY EXPR$0, ANY EXPR$1, ANY EXPR$2, ANY EXPR$4): rowcount = 7089.3, cumulative cost = {44898.9 rows, 173294.0 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901697 00-06 Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$4=[$2]) : rowType = RecordType(ANY EXPR$0, ANY EXPR$1, ANY EXPR$2, ANY EXPR$4): rowcount = 7089.3, cumulative cost = {37809.6 rows, 166204.7 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901696 00-07 SelectionVectorRemover : rowType = RecordType(ANY ITEM, ANY ITEM1, ANY ITEM2, ANY ITEM3): rowcount = 7089.3, cumulative cost = {30720.3 rows, 137847.5 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901695 00-08 Filter(condition=[IS NOT NULL($3)]) : rowType = RecordType(ANY ITEM, ANY ITEM1, ANY ITEM2, ANY ITEM3): rowcount = 7089.3, cumulative cost = {23631.0 rows, 130758.2 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901694 00-09 Project(ITEM=[ITEM($0, 'cust_id')], ITEM1=[ITEM($0, 'device')], ITEM2=[ITEM($1, 'prod_id')], ITEM3=[ITEM(ITEM($1, 'prod_id'), 0)]) : rowType = RecordType(ANY ITEM, ANY ITEM1, ANY ITEM2, ANY ITEM3): rowcount = 7877.0, cumulative cost = {15754.0 rows, 70893.0 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901693 00-10 Scan(table=[[dfs, clicks, campaignclicks_50.json]], groupscan=[EasyGroupScan [selectionRoot=maprfs:/data/nested/clicks/campaignclicks_50.json, numFiles=1, columns=[`user_info`.`cust_id`, `user_info`.`device`, `trans_info`.`prod_id`, `trans_info`.`prod_id`[0]], files=[maprfs:///data/nested/clicks/campaignclicks_50.json]]]) : rowType = RecordType(ANY user_info, ANY trans_info): rowcount = 7877.0, cumulative cost = {7877.0 rows, 15754.0 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901692{code}
 
The source data file is attached



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)