You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Xuefu Zhang (JIRA)" <ji...@apache.org> on 2014/11/07 06:37:34 UTC

[jira] [Comment Edited] (SPARK-4290) Provide an equivalent functionality of distributed cache as MR does

    [ https://issues.apache.org/jira/browse/SPARK-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201621#comment-14201621 ] 

Xuefu Zhang edited comment on SPARK-4290 at 11/7/14 5:37 AM:
-------------------------------------------------------------

Yes, SparkContext#addFile() seems to be what we need. If the files can be more efficiently broadcast to every executor, that's even better than distributed cache. In the meantime, we can set a large replication factor for the files to mitigate the problem.

To clarify, [~sandyr], [~rxin], do files added via SparkContext.addFile() get automatically downloaded to executor, or SparkFiles.get() has to be called in order to make that happen?


was (Author: xuefuz):
Yes, SparkContext#addFile() seems to be what we need. If the files can be more efficiently broadcast to every executor, that's even better than distributed cache. In the meantime, we can set a large replication factor for the files to mitigate the problem.

> Provide an equivalent functionality of distributed cache as MR does
> -------------------------------------------------------------------
>
>                 Key: SPARK-4290
>                 URL: https://issues.apache.org/jira/browse/SPARK-4290
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Xuefu Zhang
>
> MapReduce allows client to specify files to be put in distributed cache for a job and the framework guarentees that the file will be available in local file system of a node where a task of the job runs and before the tasks actually starts. While this might be achieved with Yarn via hacks, it's not available in other clusters. It would be nice to have such an equivalent functionality like this in Spark.
> It would also complement Spark's broadcast variable, which may not be suitable in certain scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org