You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2014/10/06 20:00:33 UTC
[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf

    [ https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160589#comment-14160589 ] 

Josh Rosen commented on SPARK-2585:
-----------------------------------

I tried benchmarking the time need to create a new JobConf() object and it looks like it takes ~2.3 milliseconds:

{code}
import org.apache.hadoop.mapred.JobConf


object HadoopConfBenchmark {
  def main(args: Array[String]) {
    val numConfs = 10000
    val start = System.currentTimeMillis()
    for (i <- 1 to numConfs) {
      new JobConf()
    }
    val end = System.currentTimeMillis()
    println(s"Took ${end - start} ms to create $numConfs new JobConfs")
  }
}
{code}

On my laptop, this outputs

{code}
Took 23492 ms to create 10000 new JobConfs
{code}

Since the correlation optimizer tests ran ~7 seconds slower with this PR, this slowdown might be explained if those tests were running ~3000 tasks.  This is actually plausible, since the default parallelism was pretty high in those tests (~200 partitions, if I recall) and the queries were very complicated.

For most real deployments (e.g. not running in local mode), the extra 2ms will probably be masked by other latencies (e.g. RPC), so I'd say that we should merge this patch for now and try to regain the performance elsewhere if it turns out to be a problem.

There's the option of putting this behind a configuration option, but I don't like that approach because I feel that it's important to be "correct by default" and not have options that sacrifice correctness for performance.

> Remove special handling of Hadoop JobConf
> -----------------------------------------
>
>                 Key: SPARK-2585
>                 URL: https://issues.apache.org/jira/browse/SPARK-2585
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Patrick Wendell
>            Assignee: Josh Rosen
>            Priority: Critical
>
> This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the implementation does not use shared conf objects). We no longer need to specially broadcast the Hadoop configuration since we are broadcasting RDD data anyways.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org