You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zengrui (Jira)" <ji...@apache.org> on 2021/08/12 10:50:00 UTC
[jira] [Created] (SPARK-36494) SortMergeJoin do unnecessary shuffle
for tables whose provider is hive
zengrui created SPARK-36494:
-------------------------------
Summary: SortMergeJoin do unnecessary shuffle for tables whose provider is hive
Key: SPARK-36494
URL: https://issues.apache.org/jira/browse/SPARK-36494
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.1.1
Reporter: zengrui
Create two tables in spark like this:
{code:java}
CREATE TABLE catalog_sales
(
cs_sold_date_sk BIGINT,
cs_sold_time_sk BIGINT,
cs_ship_date_sk BIGINT,
cs_bill_customer_sk BIGINT,
cs_bill_cdemo_sk BIGINT,
cs_bill_hdemo_sk BIGINT,
cs_bill_addr_sk BIGINT,
cs_ship_customer_sk BIGINT,
cs_ship_cdemo_sk BIGINT,
cs_ship_hdemo_sk BIGINT,
cs_ship_addr_sk BIGINT,
cs_call_center_sk BIGINT,
cs_catalog_page_sk BIGINT,
cs_ship_mode_sk BIGINT,
cs_warehouse_sk BIGINT,
cs_item_sk BIGINT,
cs_promo_sk BIGINT,
cs_order_number BIGINT,
cs_quantity BIGINT,
cs_wholesale_cost FLOAT,
cs_list_price FLOAT,
cs_sales_price FLOAT,
cs_ext_discount_amt FLOAT,
cs_ext_sales_price FLOAT,
cs_ext_wholesale_cost FLOAT,
cs_ext_list_price FLOAT,
cs_ext_tax FLOAT,
cs_coupon_amt FLOAT,
cs_ext_ship_cost FLOAT,
cs_net_paid FLOAT,
cs_net_paid_inc_tax FLOAT,
cs_net_paid_inc_ship FLOAT,
cs_net_paid_inc_ship_tax FLOAT,
cs_net_profit FLOAT
)
clustered by (cs_item_sk) into 10 buckets
stored as ORC;CREATE TABLE store_sales
(
ss_sold_date_sk BIGINT,
ss_sold_time_sk BIGINT,
ss_item_sk BIGINT,
ss_customer_sk BIGINT,
ss_cdemo_sk BIGINT,
ss_hdemo_sk BIGINT,
ss_addr_sk BIGINT,
ss_store_sk BIGINT,
ss_promo_sk BIGINT,
ss_ticket_number BIGINT,
ss_quantity BIGINT,
ss_wholesale_cost FLOAT,
ss_list_price FLOAT,
ss_sales_price FLOAT,
ss_ext_discount_amt FLOAT,
ss_ext_sales_price FLOAT,
ss_ext_wholesale_cost FLOAT,
ss_ext_list_price FLOAT,
ss_ext_tax FLOAT,
ss_coupon_amt FLOAT,
ss_net_paid FLOAT,
ss_net_paid_inc_tax FLOAT,
ss_net_profit FLOAT
)
CLUSTERED BY (ss_item_sk) INTO 10 buckets
stored as ORC;{code}
Then the table's provider is hive, when execute sql:
{code:java}
SELECT * FROM tpcds_orc_mintest.store_sales a join tpcds_orc_mintest.catalog_sales b on a.ss_item_sk = b.cs_item_sk limit 1000;
{code}
Spark do unnecessary shuffle before SortMergeJoin.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org