You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2015/08/19 07:30:46 UTC

[jira] [Resolved] (SPARK-9952) Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs

     [ https://issues.apache.org/jira/browse/SPARK-9952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin resolved SPARK-9952.
--------------------------------
       Resolution: Fixed
    Fix Version/s: 1.5.0

> Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs
> --------------------------------------------------------------------------
>
>                 Key: SPARK-9952
>                 URL: https://issues.apache.org/jira/browse/SPARK-9952
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>    Affects Versions: 1.5.0
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>            Priority: Critical
>             Fix For: 1.5.0
>
>
> In Scala, Seq.fill always returns a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine):
> {code}
> val numItems = 100000
> val s = Seq.fill(numItems)(1)
> for (i <- 0 until numItems) s(i)
> {code}
> It turns out that we had a loop like this in DAGScheduler code. In getPreferredLocsInternal, there's a call to {{getCacheLocs(rdd)(partition)}}.  The {{getCacheLocs}} call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput.
> We can easily fix this by replacing Seq with Array in this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org