You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hawq.apache.org by "Eleanor Lewis (JIRA)" <ji...@apache.org> on 2019/07/18 19:38:00 UTC
[jira] [Created] (HAWQ-1727) discrepancy between results obtained
with dynamic joins that should be identical
Eleanor Lewis created HAWQ-1727:
-----------------------------------
Summary: discrepancy between results obtained with dynamic joins that should be identical
Key: HAWQ-1727
URL: https://issues.apache.org/jira/browse/HAWQ-1727
Project: Apache HAWQ
Issue Type: Bug
Reporter: Eleanor Lewis
Assignee: Radar Lei
The results of the following queries should be identical. The second result is correct. It looks like the filtering that should be applied in (select distinct on... ) is not applied unless it's enclosed in (select * from ... ).
{code:java}
test_refactor=> select count(*) from tcga_sample_16 where (patient_id, entry_date) in (select distinct on (patient_id) patient_id, entry_date from tcga_sample_16 order by patient_id, entry_date desc); -[ RECORD 1 ] count | 41546
test_refactor=> select count(*) from tcga_sample_16 where (patient_id, entry_date) in (select * from (select distinct on (patient_id) patient_id, entry_date from tcga_sample_16 order by patient_id, entry_date desc) as tcga_temp); -[ RECORD 1 ] count | 11202{code}
This error occurred on a 2.0.0 verson of the Hawq db.
The explain and explain/analyze output:
{code:java}
test_refactor=> explain select count(*) from tcga_sample_16 where (patient_id, entry_date) in
(select distinct on (patient_id) patient_id, entry_date from tcga_sample_16 order by patient_id, entry_date desc);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3275.02..3275.03 rows=1 width=8)
-> Gather Motion 1:1 (slice2; segments: 1) (cost=3274.97..3275.00 rows=1 width=8)
-> Aggregate (cost=3274.97..3274.98 rows=1 width=8)
-> Hash EXISTS Join (cost=2260.57..3274.93 rows=16 width=0)
Hash Cond: public.tcga_sample_16.patient_id = public.tcga_sample_16.patient_id AND public.tcga_sample_16.entry_date = public.tcga_sample_16.entry_date
-> Append-only Scan on tcga_sample_16 (cost=0.00..806.46 rows=41546 width=21)
-> Hash (cost=1637.38..1637.38 rows=41546 width=21)
-> Broadcast Motion 1:1 (slice1; segments: 1) (cost=0.00..1637.38 rows=41546 width=21)
-> Append-only Scan on tcga_sample_16 (cost=0.00..806.46 rows=41546 width=21)
Settings: default_hash_table_bucket_number=78
Optimizer status: legacy query optimizer
(11 rows)
test_refactor=> explain select count(*) from tcga_sample_16 where (patient_id, entry_date) in
test_refactor-> (select * from
test_refactor(> (select distinct on (patient_id) patient_id, entry_date from tcga_sample_16 order by patient_id, entry_date desc)
test_refactor(> as tcga_temp);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=6545.81..6545.82 rows=1 width=8)
-> Gather Motion 1:1 (slice3; segments: 1) (cost=6545.77..6545.80 rows=1 width=8)
-> Aggregate (cost=6545.77..6545.78 rows=1 width=8)
-> Hash Join (cost=5531.38..6545.72 rows=16 width=0)
Hash Cond: public.tcga_sample_16.patient_id = tcga_temp.patient_id AND public.tcga_sample_16.entry_date = tcga_temp.entry_date
-> Append-only Scan on tcga_sample_16 (cost=0.00..806.46 rows=41546 width=21)
-> Hash (cost=5364.98..5364.98 rows=11093 width=40)
-> Broadcast Motion 1:1 (slice2; segments: 1) (cost=4824.46..5364.98 rows=11093 width=40)
-> Unique (cost=4824.46..5032.19 rows=11093 width=21)
Group By: public.tcga_sample_16.patient_id
-> Sort (cost=4824.46..4928.33 rows=41546 width=21)
Sort Key (Distinct): public.tcga_sample_16.patient_id, public.tcga_sample_16.entry_date
-> Redistribute Motion 1:1 (slice1; segments: 1) (cost=0.00..1637.38 rows=41546 width=21)
Hash Key: public.tcga_sample_16.patient_id
-> Append-only Scan on tcga_sample_16 (cost=0.00..806.46 rows=41546 width=21)
Settings: default_hash_table_bucket_number=78
Optimizer status: legacy query optimizer
(17 rows)
test_refactor=> explain analyze select count(*) from tcga_sample_16 where (patient_id, entry_date) in
test_refactor-> (select distinct on (patient_id) patient_id, entry_date from tcga_sample_16 order by patient_id, entry_date desc);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------
Aggregate (cost=3275.02..3275.03 rows=1 width=8)
Rows out: Avg 1.0 rows x 1 workers. Max/Last(seg-1:hdp-master02.hadoop.bcgsc.ca/seg-1:hdp-master02.hadoop.bcgsc.ca) 1/1 rows with 274/274 ms to end, start offset by 0.858/0.
858 ms.
-> Gather Motion 1:1 (slice2; segments: 1) (cost=3274.97..3275.00 rows=1 width=8)
Rows out: Avg 1.0 rows x 1 workers at destination. Max/Last(seg-1:hdp-master02.hadoop.bcgsc.ca/seg-1:hdp-master02.hadoop.bcgsc.ca) 1/1 rows with 274/274 ms to end, sta
rt offset by 0.858/0.858 ms.
-> Aggregate (cost=3274.97..3274.98 rows=1 width=8)
Rows out: Avg 1.0 rows x 1 workers. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 1/1 rows with 272/272 ms to end, start offset by
2.389/2.389 ms.
-> Hash EXISTS Join (cost=2260.57..3274.93 rows=16 width=0)
Hash Cond: public.tcga_sample_16.patient_id = public.tcga_sample_16.patient_id AND public.tcga_sample_16.entry_date = public.tcga_sample_16.entry_date
Rows out: Avg 41546.0 rows x 1 workers. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 41546/41546 rows with 137/137 ms to fir
st row, 269/269 ms to end, start offset by 2.389/2.389 ms.
Executor memory: 12289K bytes.
Work_mem used: 1948K bytes. Workfile: (0 spilling, 0 reused)
Hash chain length 1.0 avg, 3 max, using 39968 of 524341 buckets.
-> Append-only Scan on tcga_sample_16 (cost=0.00..806.46 rows=41546 width=21)
Rows out: Avg 41546.0 rows x 1 workers. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 41546/41546 rows with 6.499/6.499
ms to first row, 118/118 ms to end, start offset by 133/133 ms.
-> Hash (cost=1637.38..1637.38 rows=41546 width=21)
Rows in: Avg 41546.0 rows x 1 workers. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 41546/41546 rows with 127/127 ms t
o end, start offset by 6.183/6.183 ms.
-> Broadcast Motion 1:1 (slice1; segments: 1) (cost=0.00..1637.38 rows=41546 width=21)
Rows out: Avg 41546.0 rows x 1 workers at destination. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 41546/41546
rows with 0.036/0.036 ms to first row, 115/115 ms to end, start offset by 6.183/6.183 ms.
-> Append-only Scan on tcga_sample_16 (cost=0.00..806.46 rows=41546 width=21)
Rows out: Avg 41546.0 rows x 1 workers. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 41546/41546 rows with
1.695/1.695 ms to first row, 117/117 ms to end, start offset by 2.389/2.389 ms.
Slice statistics:
(slice0) Executor memory: 241K bytes.
(slice1) Executor memory: 367K bytes (seg0:hdp-slave05.hadoop.bcgsc.ca).
(slice2) Executor memory: 12660K bytes (seg0:hdp-slave05.hadoop.bcgsc.ca). Work_mem: 1948K bytes max.
Statement statistics:
Memory used: 262144K bytes
Settings: default_hash_table_bucket_number=78
Optimizer status: legacy query optimizer
Dispatcher statistics:
executors used(total/cached/new connection): (2/2/0); dispatcher time(total/connection/dispatch data): (0.133 ms/0.000 ms/0.024 ms).
dispatch data time(max/min/avg): (0.015 ms/0.008 ms/0.011 ms); consume executor data time(max/min/avg): (0.022 ms/0.015 ms/0.018 ms); free executor time(max/min/avg): (0.000 m
s/0.000 ms/0.000 ms).
Data locality statistics:
data locality ratio: 0.230; virtual segment number: 1; different host number: 1; virtual segment number per host(avg/min/max): (1/1/1); segment size(avg/min/max): (12809136.00
0 B/12809136 B/12809136 B); segment size with penalty(avg/min/max): (12907764.000 B/12907764 B/12907764 B); continuity(avg/min/max): (1.000/1.000/1.000); DFS metadatacache: 0.158
ms; resource allocation: 3.245 ms; datalocality calculation: 0.084 ms.
Total runtime: 278.581 ms
(34 rows)
test_refactor=> explain analyze select count(*) from tcga_sample_16 where (patient_id, entry_date) in
test_refactor-> (select * from
test_refactor(> (select distinct on (patient_id) patient_id, entry_date from tcga_sample_16 order by patient_id, entry_date desc)
test_refactor(> as tcga_temp);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------
Aggregate (cost=6545.81..6545.82 rows=1 width=8)
Rows out: Avg 1.0 rows x 1 workers. Max/Last(seg-1:hdp-master02.hadoop.bcgsc.ca/seg-1:hdp-master02.hadoop.bcgsc.ca) 1/1 rows with 291/291 ms to end, start offset by 4.487/4.
487 ms.
-> Gather Motion 1:1 (slice3; segments: 1) (cost=6545.77..6545.80 rows=1 width=8)
Rows out: Avg 1.0 rows x 1 workers at destination. Max/Last(seg-1:hdp-master02.hadoop.bcgsc.ca/seg-1:hdp-master02.hadoop.bcgsc.ca) 1/1 rows with 291/291 ms to end, sta
rt offset by 4.488/4.488 ms.
-> Aggregate (cost=6545.77..6545.78 rows=1 width=8)
Rows out: Avg 1.0 rows x 1 workers. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 1/1 rows with 290/290 ms to end, start offset by
5.942/5.942 ms.
-> Hash Join (cost=5531.38..6545.72 rows=16 width=0)
Hash Cond: public.tcga_sample_16.patient_id = tcga_temp.patient_id AND public.tcga_sample_16.entry_date = tcga_temp.entry_date
Rows out: Avg 11202.0 rows x 1 workers. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 11202/11202 rows with 168/168 ms to fir
st row, 289/289 ms to end, start offset by 5.943/5.943 ms.
Executor memory: 3073K bytes.
Work_mem used: 526K bytes. Workfile: (0 spilling, 0 reused)
Hash chain length 1.0 avg, 4 max, using 10741 of 131111 buckets.
-> Append-only Scan on tcga_sample_16 (cost=0.00..806.46 rows=41546 width=21)
Rows out: Avg 41546.0 rows x 1 workers. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 41546/41546 rows with 1.162/1.162
ms to first row, 112/112 ms to end, start offset by 173/173 ms.
-> Hash (cost=5364.98..5364.98 rows=11093 width=40)
Rows in: Avg 11202.0 rows x 1 workers. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 11202/11202 rows with 166/166 ms t
o end, start offset by 6.478/6.478 ms.
-> Broadcast Motion 1:1 (slice2; segments: 1) (cost=4824.46..5364.98 rows=11093 width=40)
Rows out: Avg 11202.0 rows x 1 workers at destination. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 11202/11202
rows with 156/156 ms to first row, 164/164 ms to end, start offset by 6.478/6.478 ms.
-> Unique (cost=4824.46..5032.19 rows=11093 width=21)
Group By: public.tcga_sample_16.patient_id
Rows out: Avg 11202.0 rows x 1 workers. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 11202/11202 rows with
155/155 ms to first row, 163/163 ms to end, start offset by 7.333/7.333 ms.
-> Sort (cost=4824.46..4928.33 rows=41546 width=21)
Sort Key (Distinct): public.tcga_sample_16.patient_id, public.tcga_sample_16.entry_date
Rows out: Avg 41546.0 rows x 1 workers. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 41546/41546 row
s with 155/155 ms to first row, 158/158 ms to end, start offset by 7.333/7.333 ms.
Executor memory: 9721K bytes.
Work_mem used: 9721K bytes. Workfile: (0 spilling, 0 reused)
-> Redistribute Motion 1:1 (slice1; segments: 1) (cost=0.00..1637.38 rows=41546 width=21)
Hash Key: public.tcga_sample_16.patient_id
Rows out: Avg 41546.0 rows x 1 workers at destination. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgs
c.ca) 41546/41546 rows with 0.036/0.036 ms to first row, 118/118 ms to end, start offset by 7.632/7.632 ms.
-> Append-only Scan on tcga_sample_16 (cost=0.00..806.46 rows=41546 width=21)
Rows out: Avg 41546.0 rows x 1 workers. Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 415
46/41546 rows with 1.157/1.157 ms to first row, 116/116 ms to end, start offset by 6.143/6.143 ms.
Slice statistics:
(slice0) Executor memory: 242K bytes.
(slice1) Executor memory: 371K bytes (seg0:hdp-slave05.hadoop.bcgsc.ca).
(slice2) Executor memory: 9971K bytes (seg0:hdp-slave05.hadoop.bcgsc.ca). Work_mem: 9721K bytes max.
(slice3) Executor memory: 3444K bytes (seg0:hdp-slave05.hadoop.bcgsc.ca). Work_mem: 526K bytes max.
Statement statistics:
Memory used: 262144K bytes
Settings: default_hash_table_bucket_number=78
Optimizer status: legacy query optimizer
Dispatcher statistics:
executors used(total/cached/new connection): (3/2/1); dispatcher time(total/connection/dispatch data): (3.760 ms/1563297976670.666 ms/0.052 ms).
dispatch data time(max/min/avg): (0.030 ms/0.009 ms/0.017 ms); consume executor data time(max/min/avg): (0.016 ms/0.010 ms/0.013 ms); free executor time(max/min/avg): (0.000 m
s/0.000 ms/0.000 ms).
Data locality statistics:
data locality ratio: 0.230; virtual segment number: 1; different host number: 1; virtual segment number per host(avg/min/max): (1/1/1); segment size(avg/min/max): (12809136.00
0 B/12809136 B/12809136 B); segment size with penalty(avg/min/max): (12907764.000 B/12907764 B/12907764 B); continuity(avg/min/max): (1.000/1.000/1.000); DFS metadatacache: 0.156
ms; resource allocation: 3.217 ms; datalocality calculation: 0.095 ms.
Total runtime: 299.443 ms
(46 rows)
test_refactor=>
{code}
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)