You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:18:19 UTC
[jira] [Resolved] (SPARK-19222) Limit Query Performance issue
[ https://issues.apache.org/jira/browse/SPARK-19222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-19222.
----------------------------------
Resolution: Incomplete
> Limit Query Performance issue
> -----------------------------
>
> Key: SPARK-19222
> URL: https://issues.apache.org/jira/browse/SPARK-19222
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Environment: Linux/Windows
> Reporter: Sujith Chacko
> Priority: Minor
> Labels: bulk-closed
>
> When limit is being added in the middle of the physical plan there will
> be possibility of memory bottleneck
> if the limit value is too large and system will try to aggregate all the
> partition limit values as part of single partition.
> Description:
> Eg:
> create table src_temp as select * from src limit n; (n=10000000)
> == Physical Plan ==
> ExecutedCommand
> +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, InsertIntoHiveTable]
> +- GlobalLimit 10000000
> +- LocalLimit 10000000
> +- Project [imei#101, age#102, task#103L, num#104, level#105, productdate#106, name#107, point#108]
> +- SubqueryAlias hive
> +- Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108] csv |
> As shown in above plan when the limit comes in middle,there can be two
> types of performance bottlenecks.
> scenario 1: when the partition count is very high and limit value is small
> scenario 2: when the limit value is very large
> Eg,current scenario based on following sample data of limit count is 10000000 and partition count 5
> Local Limit -------- > |partition 1||partition 2||partition 3||partition 4||partition 5|
> ----------------------> <<take n>><<take n>><<take n>><<take n>><<take n>>
> |
> Shuffle Exchange(into single partition)
> |
> Global Limit -------- > << take n>> (all the partition data will be grouped in single partition)
>
> as the above scenario occurs where system will shuffle and try to group the limit data from all partition
> to single partition which will induce performance bottleneck.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org