You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:18:19 UTC

[jira] [Resolved] (SPARK-19222) Limit Query Performance issue

     [ https://issues.apache.org/jira/browse/SPARK-19222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-19222.
----------------------------------
    Resolution: Incomplete

> Limit Query Performance issue
> -----------------------------
>
>                 Key: SPARK-19222
>                 URL: https://issues.apache.org/jira/browse/SPARK-19222
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>         Environment: Linux/Windows
>            Reporter: Sujith Chacko
>            Priority: Minor
>              Labels: bulk-closed
>
> When limit is being added in the middle of the physical plan there will 
> be possibility of memory bottleneck 
> if the limit value is too large and system will try to aggregate all the 
> partition limit values as part of single partition. 
> Description: 
> Eg: 
> create table src_temp as select * from src limit n;    (n=10000000) 
> == Physical Plan  == 
> ExecutedCommand 
>    +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, InsertIntoHiveTable] 
>          +- GlobalLimit 10000000 
>             +- LocalLimit 10000000 
>                +- Project [imei#101, age#102, task#103L, num#104, level#105, productdate#106, name#107, point#108] 
>                   +- SubqueryAlias hive 
>                      +- Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108] csv  |
> As shown in above plan when the limit comes in middle,there can be two 
> types of performance bottlenecks. 
> scenario 1: when the partition count is very high and limit value is small 
> scenario 2: when the limit value is very large 
> Eg,current scenario based on following sample data of limit count is 10000000 and partition count  5 
> Local Limit -------- > |partition 1||partition 2||partition 3||partition 4||partition 5|
>   ---------------------->  <<take n>><<take n>><<take n>><<take n>><<take n>>
>                                                                     |
>                                          Shuffle Exchange(into single partition)     
>                                                                     |
>      Global Limit -------- >                    << take n>>  (all the partition data will be grouped in single partition)                               
>   
> as the above scenario occurs where system will shuffle and try to group the limit data from all partition 
> to single partition which will induce performance bottleneck. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org