You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "KaiXinXIaoLei (JIRA)" <ji...@apache.org> on 2018/02/25 08:32:03 UTC

[jira] [Comment Edited] (SPARK-23405) The task will hang up when a small table left semi join a big table

    [ https://issues.apache.org/jira/browse/SPARK-23405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375985#comment-16375985 ] 

KaiXinXIaoLei edited comment on SPARK-23405 at 2/25/18 8:32 AM:
----------------------------------------------------------------

i run `select ls.cs_order_number from ls left semi join catalog_sales cs on ls.cs_order_number = cs.cs_order_number`, the  Optimized Logical Plan is :

== Optimized Logical Plan ==
 Join LeftSemi, (cs_order_number#1 = cs_order_number#22)
 :- Project [cs_order_number#1|#1]
 : +- Filter isnotnull(cs_order_number#1)
 : +- MetastoreRelation 100t, ls
 +- Project [cs_order_number#22|#22]
 +-Relation[cs_sold_date_sk#5,cs_ext_sales_price#28,... 10 more fields] parquet

 

 

I think the Optimized Logical Plan should be:

== Optimized Logical Plan ==
 Join LeftSemi, (cs_order_number#1 = cs_order_number#22)
 :- Project [cs_order_number#1|#1]
 : +- Filter isnotnull(cs_order_number#1)
 : +- MetastoreRelation 100t, ls
 +- Project [cs_order_number#22|#22]
 {color:#ff0000}+- Filter isnotnull(cs_order_number#22){color}
 +- Relation[cs_sold_date_sk#5,,cs_ext_sales_price#28,... 10 more fields] parquet

 


was (Author: kaixinxiaolei):
i run `select ls.cs_order_number from ls left semi join catalog_sales cs on ls.cs_order_number = cs.cs_order_number`, the  Optimized Logical Plan is :

== Optimized Logical Plan ==
Join LeftSemi, (cs_order_number#1 = cs_order_number#22)
:- Project [cs_order_number#1]
: +- Filter isnotnull(cs_order_number#1)
: +- MetastoreRelation 100t, ls
+- Project [cs_order_number#22]
 +-Relation[cs_sold_date_sk#5,cs_sold_time_sk#6,cs_ship_date_sk#7,cs_bill_customer_sk#8,cs_bill_cdemo_sk#9,cs_bill_hdemo_sk#10,cs_bill_addr_sk#11,cs_ship_customer_sk#12,cs_ship_cdemo_sk#13,cs_ship_hdemo_sk#14,cs_ship_addr_sk#15,cs_call_center_sk#16,cs_catalog_page_sk#17,cs_ship_mod

 

 

I think the Optimized Logical Plan should be:

== Optimized Logical Plan ==
Join LeftSemi, (cs_order_number#1 = cs_order_number#22)
:- Project [cs_order_number#1]
: +- Filter isnotnull(cs_order_number#1)
: +- MetastoreRelation 100t, ls
+- Project [cs_order_number#22]
 {color:#FF0000}+- Filter isnotnull(cs_order_number#22){color}
 +- Relation[cs_sold_date_sk#5,cs_sold_time_sk#6,cs_ship_date_sk#7,cs_bill_customer_sk#8,cs_bill_cdemo_sk#9,cs_bill_hdemo_sk#10,cs_bill_addr_sk#11,cs_ship_customer_sk#12,cs_ship_cdemo_sk#13,cs_ship_hdemo_sk#14,cs_ship_addr_sk#15,cs_call_

 

> The task will hang up when a small table left semi join a big table
> -------------------------------------------------------------------
>
>                 Key: SPARK-23405
>                 URL: https://issues.apache.org/jira/browse/SPARK-23405
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.1
>            Reporter: KaiXinXIaoLei
>            Priority: Major
>         Attachments: SQL.png, taskhang up.png
>
>
> # I run a sql: `select ls.cs_order_number from ls left semi join catalog_sales cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table is a small table ,and the number is one. The `catalog_sales` table is a big table,  and the number is 10 billion. The task will be hang up:
> !taskhang up.png!
>  And the sql page is :
> !SQL.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org