You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hawq.apache.org by "Eleanor Lewis (JIRA)" <ji...@apache.org> on 2019/07/18 19:38:00 UTC

[jira] [Created] (HAWQ-1727) discrepancy between results obtained with dynamic joins that should be identical

Eleanor Lewis created HAWQ-1727:
-----------------------------------

             Summary: discrepancy between results obtained with dynamic joins that should be identical
                 Key: HAWQ-1727
                 URL: https://issues.apache.org/jira/browse/HAWQ-1727
             Project: Apache HAWQ
          Issue Type: Bug
            Reporter: Eleanor Lewis
            Assignee: Radar Lei


 

The results of the following queries should be identical. The second result is correct. It looks like the filtering that should be applied in (select distinct on... ) is not applied unless it's enclosed in (select * from ... ).
{code:java}
test_refactor=> select count(*) from tcga_sample_16 where (patient_id, entry_date) in (select distinct on (patient_id) patient_id, entry_date from tcga_sample_16 order by patient_id, entry_date desc); -[ RECORD 1 ] count | 41546 

test_refactor=> select count(*) from tcga_sample_16 where (patient_id, entry_date) in (select * from (select distinct on (patient_id) patient_id, entry_date from tcga_sample_16 order by patient_id, entry_date desc) as tcga_temp); -[ RECORD 1 ] count | 11202{code}
 

 

This error occurred on a 2.0.0 verson of the Hawq db.

 

The explain and explain/analyze output:
{code:java}
test_refactor=> explain select count(*) from tcga_sample_16 where (patient_id, entry_date) in

(select distinct on (patient_id) patient_id, entry_date from tcga_sample_16 order by patient_id, entry_date desc);

                                                                                 QUERY PLAN                                                                                 

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Aggregate  (cost=3275.02..3275.03 rows=1 width=8)

   ->  Gather Motion 1:1  (slice2; segments: 1)  (cost=3274.97..3275.00 rows=1 width=8)

         ->  Aggregate  (cost=3274.97..3274.98 rows=1 width=8)

               ->  Hash EXISTS Join  (cost=2260.57..3274.93 rows=16 width=0)

                     Hash Cond: public.tcga_sample_16.patient_id = public.tcga_sample_16.patient_id AND public.tcga_sample_16.entry_date = public.tcga_sample_16.entry_date

                     ->  Append-only Scan on tcga_sample_16  (cost=0.00..806.46 rows=41546 width=21)

                     ->  Hash  (cost=1637.38..1637.38 rows=41546 width=21)

                           ->  Broadcast Motion 1:1  (slice1; segments: 1)  (cost=0.00..1637.38 rows=41546 width=21)

                                 ->  Append-only Scan on tcga_sample_16  (cost=0.00..806.46 rows=41546 width=21)

 Settings:  default_hash_table_bucket_number=78

 Optimizer status: legacy query optimizer

(11 rows)


test_refactor=> explain select count(*) from tcga_sample_16 where (patient_id, entry_date) in

test_refactor-> (select * from

test_refactor(> (select distinct on (patient_id) patient_id, entry_date from tcga_sample_16 order by patient_id, entry_date desc)

test_refactor(> as tcga_temp);

                                                                     QUERY PLAN                                                                     

----------------------------------------------------------------------------------------------------------------------------------------------------

 Aggregate  (cost=6545.81..6545.82 rows=1 width=8)

   ->  Gather Motion 1:1  (slice3; segments: 1)  (cost=6545.77..6545.80 rows=1 width=8)

         ->  Aggregate  (cost=6545.77..6545.78 rows=1 width=8)

               ->  Hash Join  (cost=5531.38..6545.72 rows=16 width=0)

                     Hash Cond: public.tcga_sample_16.patient_id = tcga_temp.patient_id AND public.tcga_sample_16.entry_date = tcga_temp.entry_date

                     ->  Append-only Scan on tcga_sample_16  (cost=0.00..806.46 rows=41546 width=21)

                     ->  Hash  (cost=5364.98..5364.98 rows=11093 width=40)

                           ->  Broadcast Motion 1:1  (slice2; segments: 1)  (cost=4824.46..5364.98 rows=11093 width=40)

                                 ->  Unique  (cost=4824.46..5032.19 rows=11093 width=21)

                                       Group By: public.tcga_sample_16.patient_id

                                       ->  Sort  (cost=4824.46..4928.33 rows=41546 width=21)

                                             Sort Key (Distinct): public.tcga_sample_16.patient_id, public.tcga_sample_16.entry_date

                                             ->  Redistribute Motion 1:1  (slice1; segments: 1)  (cost=0.00..1637.38 rows=41546 width=21)

                                                   Hash Key: public.tcga_sample_16.patient_id

                                                   ->  Append-only Scan on tcga_sample_16  (cost=0.00..806.46 rows=41546 width=21)

 Settings:  default_hash_table_bucket_number=78

 Optimizer status: legacy query optimizer

(17 rows)



test_refactor=> explain analyze select count(*) from tcga_sample_16 where (patient_id, entry_date) in
test_refactor-> (select distinct on (patient_id) patient_id, entry_date from tcga_sample_16 order by patient_id, entry_date desc);
                                                                                                                                                                                  
                               QUERY PLAN                                                                                                                                         
                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------
 Aggregate  (cost=3275.02..3275.03 rows=1 width=8)
   Rows out:  Avg 1.0 rows x 1 workers.  Max/Last(seg-1:hdp-master02.hadoop.bcgsc.ca/seg-1:hdp-master02.hadoop.bcgsc.ca) 1/1 rows with 274/274 ms to end, start offset by 0.858/0.
858 ms.
   ->  Gather Motion 1:1  (slice2; segments: 1)  (cost=3274.97..3275.00 rows=1 width=8)
         Rows out:  Avg 1.0 rows x 1 workers at destination.  Max/Last(seg-1:hdp-master02.hadoop.bcgsc.ca/seg-1:hdp-master02.hadoop.bcgsc.ca) 1/1 rows with 274/274 ms to end, sta
rt offset by 0.858/0.858 ms.
         ->  Aggregate  (cost=3274.97..3274.98 rows=1 width=8)
               Rows out:  Avg 1.0 rows x 1 workers.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 1/1 rows with 272/272 ms to end, start offset by
2.389/2.389 ms.
               ->  Hash EXISTS Join  (cost=2260.57..3274.93 rows=16 width=0)
                     Hash Cond: public.tcga_sample_16.patient_id = public.tcga_sample_16.patient_id AND public.tcga_sample_16.entry_date = public.tcga_sample_16.entry_date
                     Rows out:  Avg 41546.0 rows x 1 workers.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 41546/41546 rows with 137/137 ms to fir
st row, 269/269 ms to end, start offset by 2.389/2.389 ms.
                     Executor memory:  12289K bytes.
                     Work_mem used:  1948K bytes. Workfile: (0 spilling, 0 reused)
                     Hash chain length 1.0 avg, 3 max, using 39968 of 524341 buckets.
                     ->  Append-only Scan on tcga_sample_16  (cost=0.00..806.46 rows=41546 width=21)
                           Rows out:  Avg 41546.0 rows x 1 workers.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 41546/41546 rows with 6.499/6.499
 ms to first row, 118/118 ms to end, start offset by 133/133 ms.
                     ->  Hash  (cost=1637.38..1637.38 rows=41546 width=21)
                           Rows in:  Avg 41546.0 rows x 1 workers.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 41546/41546 rows with 127/127 ms t
o end, start offset by 6.183/6.183 ms.
                           ->  Broadcast Motion 1:1  (slice1; segments: 1)  (cost=0.00..1637.38 rows=41546 width=21)
                                 Rows out:  Avg 41546.0 rows x 1 workers at destination.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 41546/41546
rows with 0.036/0.036 ms to first row, 115/115 ms to end, start offset by 6.183/6.183 ms.
                                 ->  Append-only Scan on tcga_sample_16  (cost=0.00..806.46 rows=41546 width=21)
                                       Rows out:  Avg 41546.0 rows x 1 workers.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 41546/41546 rows with
 1.695/1.695 ms to first row, 117/117 ms to end, start offset by 2.389/2.389 ms.
 Slice statistics:
   (slice0)    Executor memory: 241K bytes.
   (slice1)    Executor memory: 367K bytes (seg0:hdp-slave05.hadoop.bcgsc.ca).
   (slice2)    Executor memory: 12660K bytes (seg0:hdp-slave05.hadoop.bcgsc.ca).  Work_mem: 1948K bytes max.
 Statement statistics:
   Memory used: 262144K bytes
 Settings:  default_hash_table_bucket_number=78
 Optimizer status: legacy query optimizer
 Dispatcher statistics:
   executors used(total/cached/new connection): (2/2/0); dispatcher time(total/connection/dispatch data): (0.133 ms/0.000 ms/0.024 ms).
   dispatch data time(max/min/avg): (0.015 ms/0.008 ms/0.011 ms); consume executor data time(max/min/avg): (0.022 ms/0.015 ms/0.018 ms); free executor time(max/min/avg): (0.000 m
s/0.000 ms/0.000 ms).
 Data locality statistics:
   data locality ratio: 0.230; virtual segment number: 1; different host number: 1; virtual segment number per host(avg/min/max): (1/1/1); segment size(avg/min/max): (12809136.00
0 B/12809136 B/12809136 B); segment size with penalty(avg/min/max): (12907764.000 B/12907764 B/12907764 B); continuity(avg/min/max): (1.000/1.000/1.000); DFS metadatacache: 0.158
 ms; resource allocation: 3.245 ms; datalocality calculation: 0.084 ms.
 Total runtime: 278.581 ms
(34 rows)

test_refactor=> explain analyze select count(*) from tcga_sample_16 where (patient_id, entry_date) in
test_refactor-> (select * from
test_refactor(> (select distinct on (patient_id) patient_id, entry_date from tcga_sample_16 order by patient_id, entry_date desc)
test_refactor(> as tcga_temp);
                                                                                                                                                                                  
                               QUERY PLAN                                                                                                                                         
                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------
 Aggregate  (cost=6545.81..6545.82 rows=1 width=8)
   Rows out:  Avg 1.0 rows x 1 workers.  Max/Last(seg-1:hdp-master02.hadoop.bcgsc.ca/seg-1:hdp-master02.hadoop.bcgsc.ca) 1/1 rows with 291/291 ms to end, start offset by 4.487/4.
487 ms.
   ->  Gather Motion 1:1  (slice3; segments: 1)  (cost=6545.77..6545.80 rows=1 width=8)
         Rows out:  Avg 1.0 rows x 1 workers at destination.  Max/Last(seg-1:hdp-master02.hadoop.bcgsc.ca/seg-1:hdp-master02.hadoop.bcgsc.ca) 1/1 rows with 291/291 ms to end, sta
rt offset by 4.488/4.488 ms.
         ->  Aggregate  (cost=6545.77..6545.78 rows=1 width=8)
               Rows out:  Avg 1.0 rows x 1 workers.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 1/1 rows with 290/290 ms to end, start offset by
5.942/5.942 ms.
               ->  Hash Join  (cost=5531.38..6545.72 rows=16 width=0)
                     Hash Cond: public.tcga_sample_16.patient_id = tcga_temp.patient_id AND public.tcga_sample_16.entry_date = tcga_temp.entry_date
                     Rows out:  Avg 11202.0 rows x 1 workers.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 11202/11202 rows with 168/168 ms to fir
st row, 289/289 ms to end, start offset by 5.943/5.943 ms.
                     Executor memory:  3073K bytes.
                     Work_mem used:  526K bytes. Workfile: (0 spilling, 0 reused)
                     Hash chain length 1.0 avg, 4 max, using 10741 of 131111 buckets.
                     ->  Append-only Scan on tcga_sample_16  (cost=0.00..806.46 rows=41546 width=21)
                           Rows out:  Avg 41546.0 rows x 1 workers.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 41546/41546 rows with 1.162/1.162
 ms to first row, 112/112 ms to end, start offset by 173/173 ms.
                     ->  Hash  (cost=5364.98..5364.98 rows=11093 width=40)
                           Rows in:  Avg 11202.0 rows x 1 workers.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 11202/11202 rows with 166/166 ms t
o end, start offset by 6.478/6.478 ms.
                           ->  Broadcast Motion 1:1  (slice2; segments: 1)  (cost=4824.46..5364.98 rows=11093 width=40)
                                 Rows out:  Avg 11202.0 rows x 1 workers at destination.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 11202/11202
rows with 156/156 ms to first row, 164/164 ms to end, start offset by 6.478/6.478 ms.
                                 ->  Unique  (cost=4824.46..5032.19 rows=11093 width=21)
                                       Group By: public.tcga_sample_16.patient_id
                                       Rows out:  Avg 11202.0 rows x 1 workers.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 11202/11202 rows with
 155/155 ms to first row, 163/163 ms to end, start offset by 7.333/7.333 ms.
                                       ->  Sort  (cost=4824.46..4928.33 rows=41546 width=21)
                                             Sort Key (Distinct): public.tcga_sample_16.patient_id, public.tcga_sample_16.entry_date
                                             Rows out:  Avg 41546.0 rows x 1 workers.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 41546/41546 row
s with 155/155 ms to first row, 158/158 ms to end, start offset by 7.333/7.333 ms.
                                             Executor memory:  9721K bytes.
                                             Work_mem used:  9721K bytes. Workfile: (0 spilling, 0 reused)
                                             ->  Redistribute Motion 1:1  (slice1; segments: 1)  (cost=0.00..1637.38 rows=41546 width=21)
                                                   Hash Key: public.tcga_sample_16.patient_id
                                                   Rows out:  Avg 41546.0 rows x 1 workers at destination.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgs
c.ca) 41546/41546 rows with 0.036/0.036 ms to first row, 118/118 ms to end, start offset by 7.632/7.632 ms.
                                                   ->  Append-only Scan on tcga_sample_16  (cost=0.00..806.46 rows=41546 width=21)
                                                         Rows out:  Avg 41546.0 rows x 1 workers.  Max/Last(seg0:hdp-slave05.hadoop.bcgsc.ca/seg0:hdp-slave05.hadoop.bcgsc.ca) 415
46/41546 rows with 1.157/1.157 ms to first row, 116/116 ms to end, start offset by 6.143/6.143 ms.
 Slice statistics:
   (slice0)    Executor memory: 242K bytes.
   (slice1)    Executor memory: 371K bytes (seg0:hdp-slave05.hadoop.bcgsc.ca).
   (slice2)    Executor memory: 9971K bytes (seg0:hdp-slave05.hadoop.bcgsc.ca).  Work_mem: 9721K bytes max.
   (slice3)    Executor memory: 3444K bytes (seg0:hdp-slave05.hadoop.bcgsc.ca).  Work_mem: 526K bytes max.
 Statement statistics:
   Memory used: 262144K bytes
 Settings:  default_hash_table_bucket_number=78
 Optimizer status: legacy query optimizer
 Dispatcher statistics:
   executors used(total/cached/new connection): (3/2/1); dispatcher time(total/connection/dispatch data): (3.760 ms/1563297976670.666 ms/0.052 ms).
   dispatch data time(max/min/avg): (0.030 ms/0.009 ms/0.017 ms); consume executor data time(max/min/avg): (0.016 ms/0.010 ms/0.013 ms); free executor time(max/min/avg): (0.000 m
s/0.000 ms/0.000 ms).
 Data locality statistics:
   data locality ratio: 0.230; virtual segment number: 1; different host number: 1; virtual segment number per host(avg/min/max): (1/1/1); segment size(avg/min/max): (12809136.00
0 B/12809136 B/12809136 B); segment size with penalty(avg/min/max): (12907764.000 B/12907764 B/12907764 B); continuity(avg/min/max): (1.000/1.000/1.000); DFS metadatacache: 0.156
 ms; resource allocation: 3.217 ms; datalocality calculation: 0.095 ms.
 Total runtime: 299.443 ms
(46 rows)

test_refactor=>
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)