You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Taylor Smock (Jira)" <ji...@apache.org> on 2020/10/12 14:19:00 UTC

[jira] [Created] (SPARK-33120) Lazy Load of SparkContext.addFiles

Taylor Smock created SPARK-33120:
------------------------------------

             Summary: Lazy Load of SparkContext.addFiles
                 Key: SPARK-33120
                 URL: https://issues.apache.org/jira/browse/SPARK-33120
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.0.1
         Environment: Mac OS X (2 systems), workload to eventually be run on Amazon EMR.

Java 11 application.
            Reporter: Taylor Smock


In my spark job, I may have various random files that may or may not be used by each task.

I would like to avoid copying all of the files to every executor until it is actually needed.

 

What I've tried:
 * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were distributed to all clients.
 * Broadcast variables. Since I _don't_ know what files I'm going to need until I have started the task, I have to broadcast all the data at once, which leads to nodes getting data, and then caching it to disk. In short, the same issues as SparkContext.addFiles, but with the added benefit of having the ability to create a mapping of paths to files.

What I would like to see:
 * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, Enum.WaitForAvailability) or Future<?> future = SparkFiles.get(file)

 

 

Notes: https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346 indicated that `SparkFiles.get` would be required to get the data on the local driver, but in my testing that did not appear to be the case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org