You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (Jira)" <ji...@apache.org> on 2022/08/27 00:20:00 UTC

[jira] [Resolved] (SPARK-40211) Allow executeTake() / collectLimit's number of starting partitions to be customized

     [ https://issues.apache.org/jira/browse/SPARK-40211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Rosen resolved SPARK-40211.
--------------------------------
    Fix Version/s: 3.4.0
         Assignee: Ziqi Liu
       Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/37661] 

> Allow executeTake() / collectLimit's number of starting partitions to be customized
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-40211
>                 URL: https://issues.apache.org/jira/browse/SPARK-40211
>             Project: Spark
>          Issue Type: Story
>          Components: Spark Core, SQL
>    Affects Versions: 3.4.0
>            Reporter: Ziqi Liu
>            Assignee: Ziqi Liu
>            Priority: Major
>             Fix For: 3.4.0
>
>
> Today, Spark’s executeTake() code allow for the limitScaleUpFactor to be customized but does not allow for the initial number of partitions to be customized: it’s currently hardcoded to {{{}1{}}}.
> We should add a configuration so that the initial partition count can be customized. By setting this new configuration to a high value we could effectively mitigate the “run multiple jobs” overhead in {{take}} behavior. We could also set it to higher-than-1-but-still-small values (like, say, {{{}10{}}}) to achieve a middle-ground trade-off.
>  
> Essentially, we need to make {{numPartsToTry = 1L}} ([code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L481]) customizable. We should do this via a new SQL conf, similar to the {{limitScaleUpFactor}} conf.
>  
> Spark has several near-duplicate versions of this code ([see code search|https://github.com/apache/spark/search?q=numPartsToTry+%3D+1]) in:
>  * SparkPlan
>  * RDD
>  * pyspark rdd
> Also, in pyspark  {{limitScaleUpFactor}}  is not supported either. So for now, I will focus on scala side first, leaving python side untouched and meanwhile sync with pyspark members. Depending on the progress we can do them all in one PR or make scala side change first and leave pyspark change as a follow-up.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org