You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2020/10/12 21:01:00 UTC

[jira] [Comment Edited] (SPARK-33120) Lazy Load of SparkContext.addFiles

    [ https://issues.apache.org/jira/browse/SPARK-33120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212670#comment-17212670 ] 

Dongjoon Hyun edited comment on SPARK-33120 at 10/12/20, 9:00 PM:
------------------------------------------------------------------

Hi, [~tsmock]. What is the benefit you need here?
bq. I would like to avoid copying all of the files to every executor until it is actually needed.


was (Author: dongjoon):
Hi, [~tsmock]. What is the benefit you need here?
> I would like to avoid copying all of the files to every executor until it is actually needed.

> Lazy Load of SparkContext.addFiles
> ----------------------------------
>
>                 Key: SPARK-33120
>                 URL: https://issues.apache.org/jira/browse/SPARK-33120
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.1
>         Environment: Mac OS X (2 systems), workload to eventually be run on Amazon EMR.
> Java 11 application.
>            Reporter: Taylor Smock
>            Priority: Minor
>
> In my spark job, I may have various random files that may or may not be used by each task.
> I would like to avoid copying all of the files to every executor until it is actually needed.
>  
> What I've tried:
>  * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were distributed to all clients.
>  * Broadcast variables. Since I _don't_ know what files I'm going to need until I have started the task, I have to broadcast all the data at once, which leads to nodes getting data, and then caching it to disk. In short, the same issues as SparkContext.addFiles, but with the added benefit of having the ability to create a mapping of paths to files.
> What I would like to see:
>  * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, Enum.WaitForAvailability) or Future<?> future = SparkFiles.get(file)
>  
>  
> Notes: https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346 indicated that `SparkFiles.get` would be required to get the data on the local driver, but in my testing that did not appear to be the case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org