You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2015/05/08 21:54:59 UTC

[jira] [Commented] (SPARK-7410) Add option to avoid broadcasting configuration with newAPIHadoopFile

    [ https://issues.apache.org/jira/browse/SPARK-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535386#comment-14535386 ] 

Josh Rosen commented on SPARK-7410:
-----------------------------------

We should confirm this, but if I recall the reason that we have to broadcast these separately has something to do with configuration mutability or thread-safety.  Based on a quick glance at SPARK-2585, it looks like I tried folding this into the RDD broadcast but this caused performance issues for RDDs with huge numbers of tasks.  If you're interested in fixing this, I'd take a closer look through that old JIRA to try to figure out whether its discussion is still relevant.

> Add option to avoid broadcasting configuration with newAPIHadoopFile
> --------------------------------------------------------------------
>
>                 Key: SPARK-7410
>                 URL: https://issues.apache.org/jira/browse/SPARK-7410
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.4.0
>            Reporter: Sandy Ryza
>
> I'm working with a Spark application that creates thousands of HadoopRDDs and unions them together.  Certain details of the way the data is stored require this.
> Creating ten thousand of these RDDs takes about 10 minutes, even before any of them is used in an action.  I dug into why this takes so long and it looks like the overhead of broadcasting the Hadoop configuration is taking up most of the time.  In this case, the broadcasting isn't helpful because each HadoopRDD only corresponds to one or two tasks.  When I reverted the original change that switched to broadcasting configurations, the time it took to instantiate these RDDs improved 10x.
> It would be nice if there was a way to turn this broadcasting off.  Either through a Spark configuration option, a Hadoop configuration option, or an argument to hadoopFile / newAPIHadoopFile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org