You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Aman Sinha (Jira)" <ji...@apache.org> on 2020/10/26 21:54:00 UTC

[jira] [Created] (IMPALA-10287) Distribution strategy is sub-optimal for certain queries

Aman Sinha created IMPALA-10287:
-----------------------------------

             Summary: Distribution strategy is sub-optimal for certain queries
                 Key: IMPALA-10287
                 URL: https://issues.apache.org/jira/browse/IMPALA-10287
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 3.4.0
            Reporter: Aman Sinha
            Assignee: Aman Sinha


I ran a simplified query (extracted from q78 of TPC-DS) on a 600GB dataset on an 8 node cluster. I forced the distribution strategy for the left outer join and compared Broadcast vs Hash Partition for different values of mt_dop.  The example query and results are shown below (elapsed times are in seconds):

{noformat}
Query (with shuffle or broadcast hint):
select count(*)
   from store_sales
   left join [shuffle] store_returns on sr_ticket_number=ss_ticket_number 
         and ss_item_sk=sr_item_sk
   join date_dim on ss_sold_date_sk = d_date_sk
   where sr_ticket_number is null
   and d_year=2002;
{noformat}

||mt_dop||Broadcast||Partition||
|1|45|15|
|2|37|9|
|4|33|5|
|8|31|4|
|12|31|4|

Given the nearly 7.5x speedup for partition distribution at mt_dop = 12 (which is the default), it indicates that the cost formula comparing the broadcast vs partition needs to be modified to take into account the mt_dop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org