You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by JoshRosen <gi...@git.apache.org> on 2016/07/29 04:55:36 UTC

[GitHub] spark pull request #14396: [SPARK-16787] SparkContext.addFile() should not t...

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/14396

    [SPARK-16787] SparkContext.addFile() should not throw if called twice with the same file

    ## What changes were proposed in this pull request?
    
    The behavior of `SparkContext.addFile()` changed slightly with the introduction of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where it was disabled by default) and became the default / only file server in Spark 2.0.0.
    
    Prior to 2.0, calling `SparkContext.addFile()` with files that have the same name and identical contents would succeed. This behavior was never explicitly documented but Spark has behaved this way since very early 1.x versions.
    
    In 2.0 (or 1.6 with the Netty file server enabled), the second `addFile()` call will fail with a requirement error because NettyStreamManager tries to guard against duplicate file registration.
    
    This problem also affects `addJar()` in a more subtle way: the `fileServer.addJar()` call will also fail with an exception but that exception is logged and ignored; I believe that the problematic exception-catching path was mistakenly copied from some old code which was only relevant to very old versions of Spark and YARN mode.
    
    I believe that this change of behavior was unintentional, so this patch weakens the `require` check so that adding the same filename at the same path will succeed.
    
    At file download time, Spark tasks will fail with exceptions if an executor already has a local copy of a file and that file's contents do not match the contents of the file being downloaded / added. As a result, it's important that we prevent files with the same name and different contents from being served because allowing that can effectively brick an executor by preventing it from successfully launching any new tasks. Before this patch's change, this was prevented by forbidding `addFile()` from being called twice on files with the same name. Because Spark does not defensively copy local files that are passed to `addFile` it is vulnerable to files' contents changing, so I think it's okay to rely on an implicit assumption that these files are intended to be immutable (since if they _are_ mutable then this can lead to either explicit task failures or implicit incorrectness (in case new executors silently get newer copies of the file while old executors continue to use an older v
 ersion)). To guard against this, I have decided to only update the file addition timestamps on the first call to `addFile()`; duplicate calls will succeed but will not update the timestamp. This behavior is fine as long as we assume files are immutable, which seems reasonable given the behaviors described above.
    
    As part of this change, I also improved the thread-safety of the `addedJars` and `addedFiles` maps; this is important because these maps may be concurrently read by a task launching thread and written by a driver thread in case the user's driver code is multi-threaded. 
    
    ## How was this patch tested?
    
    I added regression tests in `SparkContextSuite`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-16787

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14396.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14396
    
----
commit 99d9855c109233bbeb4a3501041e4ecf4825c278
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-07-29T01:36:24Z

    Add failing regression test.

commit 3ecdb88f52fa21dc3b8c89b27a82a33e6d2d937d
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-07-29T01:36:34Z

    Remove catch case which masked error for addJar.

commit c412f991c145cf02affd7c2d38de016a9b7f548d
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-07-29T01:56:49Z

    Fix bug.

commit b98d1492ec6a6cea4e5a8cca8e1e23fee2c120e9
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-07-29T02:54:34Z

    Add back require but weaken it to only accept identical paths.

commit 0d7dd0d12ace12bf54c3d6b62b802ffa4de800a3
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-07-29T04:32:03Z

    Improve thread-safety; do not update timestamp for file under assumption of immutability.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14396: [SPARK-16787] SparkContext.addFile() should not t...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14396#discussion_r72837179
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -1430,14 +1430,10 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
           schemeCorrectedPath
         }
         val timestamp = System.currentTimeMillis
    -    addedFiles(key) = timestamp
    -
    -    // Fetch the file locally in case a job is executed using DAGScheduler.runLocally().
    -    Utils.fetchFile(path, new File(SparkFiles.getRootDirectory()), conf, env.securityManager,
    -      hadoopConfiguration, timestamp, useCache = false)
    -
    -    logInfo("Added file " + path + " at " + key + " with timestamp " + addedFiles(key))
    -    postEnvironmentUpdate()
    +    if (addedFiles.putIfAbsent(key, timestamp).isEmpty) {
    --- End diff --
    
    Nevermind, I see you're actually changing the behavior from 1.x instead of restoring it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    Also merged into branch-2.0 for inclusion in 2.0.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    Hi @NYcleaner,
    
    This pull request isn't the right venue for making PySpark API feature requests. Instead, please file a ticket at https://issues.apache.org/jira/browse/SPARK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    @JoshRosen LGTM for the current patch. However, I have a question about the following description in your PR:
    
    > At file download time, Spark tasks will fail with exceptions if an executor already has a local copy of a file and that file's contents do not match the contents of the file being downloaded / added.
    
    `Utils.fetchFile` will skip the downloading step if a local copy exists. Hence it won't check the content. I think the file content checking is just to prevent from hash code conflicts for caching file names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by NYcleaner <gi...@git.apache.org>.
Github user NYcleaner commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    Thank you


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    **[Test build #63083 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63083/consoleFull)** for PR 14396 at commit [`9aa32b3`](https://github.com/apache/spark/commit/9aa32b394cd80c776ac567a662c9471351228b93).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    @zsxwing, I meant to describe what happens on executors in the following scenario:
    
    - `addFile(foo)` is called for the first time at `timestamp = 1`
    - A task runs on an executor and downloads the copy of the file added at `timestamp = 1`.
      - By default, the [file fetch cache is enabled](https://github.com/apache/spark/blob/2eedc00b04ef8ca771ff64c4f834c25f835f5f44/core/src/main/scala/org/apache/spark/util/Utils.scala#L432) and filenames in that cache incorporate timestamps. Thus, this file will be downloaded to a file named `<hashcode of foo's URL>$timestamp_cache`.
    - `addFile(foo)` is called a second time at `timestamp = 2` and the same file is passed to it.
    - A task runs on an executor and discovers that the added file's timestamp (2) is newer than the timestamp of the file that it has already downloaded (1), so it tries to fetch files again:
      - Because the file with the newer timestamp is not present in the fetch file cache, a new copy of the file will be downloaded. **<--- this is the second download I was referring to**
    
    If the fetch file cache is disabled, on the other hand, then we directly call [`doFetchFile`](https://github.com/apache/spark/blob/2eedc00b04ef8ca771ff64c4f834c25f835f5f44/core/src/main/scala/org/apache/spark/util/Utils.scala#L617) which, in turn, will call `downloadFile()`, which [downloads the file to a temporary file](https://github.com/apache/spark/blob/2eedc00b04ef8ca771ff64c4f834c25f835f5f44/core/src/main/scala/org/apache/spark/util/Utils.scala#L499) before considering whether to overwrite an existing file.
    
    In either case, it looks like re-adding a file with a new timestamp will trigger downloads on the executors and those downloads will be unnecessary if the file's contents are unchanged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by josephguan <gi...@git.apache.org>.
Github user josephguan commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    Hi @JoshRosen
    
    I'm investigating [SPARK-19417 spark.files.overwrite is ignored](https://issues.apache.org/jira/browse/SPARK-19417). But according to this PR, it seems it is designed to ignore spark.files.overwrite, as we assume files are immutable and can not be overwrite, right?
    
    So, shall we update the document and remove codes relating to 'spark.files.overwrite' property?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14396: [SPARK-16787] SparkContext.addFile() should not t...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14396#discussion_r72838556
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -1430,14 +1430,10 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
           schemeCorrectedPath
         }
         val timestamp = System.currentTimeMillis
    -    addedFiles(key) = timestamp
    -
    -    // Fetch the file locally in case a job is executed using DAGScheduler.runLocally().
    -    Utils.fetchFile(path, new File(SparkFiles.getRootDirectory()), conf, env.securityManager,
    -      hadoopConfiguration, timestamp, useCache = false)
    -
    -    logInfo("Added file " + path + " at " + key + " with timestamp " + addedFiles(key))
    -    postEnvironmentUpdate()
    +    if (addedFiles.putIfAbsent(key, timestamp).isEmpty) {
    --- End diff --
    
    Yeah, this is actually intentional. I experimented with implementing the 1.x behavior and started writing some tests to verify that newer versions of files took precedence over old ones but then discovered that Spark executors will crash with exceptions if they've downloaded a file with a given name and the new file's contents don't match the old file. Given this behavior it seems that updating the timestamp will work only if the new file has the same contents as the old file (in which case it doesn't matter what we do with the timestamp) or if all executors are replaced before running any further tasks (which seems like an obscure use-case that we don't want to optimize for). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    Great! I'm going to merge this now. At some point we should clarify the user-facing documentation for `addFile()`, etc., but let's do that in a followup.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by NYcleaner <gi...@git.apache.org>.
Github user NYcleaner commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    hi, The  addFile(path)  function in pyspark only have one param 'path'', so  if  i have many files . I had to use addFile many times, can you  make the function as same  as in the  scala APi 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    **[Test build #63024 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63024/consoleFull)** for PR 14396 at commit [`9aa32b3`](https://github.com/apache/spark/commit/9aa32b394cd80c776ac567a662c9471351228b93).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63024/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    **[Test build #62994 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62994/consoleFull)** for PR 14396 at commit [`0d7dd0d`](https://github.com/apache/spark/commit/0d7dd0d12ace12bf54c3d6b62b802ffa4de800a3).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    **[Test build #63024 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63024/consoleFull)** for PR 14396 at commit [`9aa32b3`](https://github.com/apache/spark/commit/9aa32b394cd80c776ac567a662c9471351228b93).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    A few other notes:
    
    - It's important that we keep the timestamps because they're necessary for cross-worker / cross-executor JAR + file download caching to work correctly. This is a relatively obscure feature which was added a few releases ago in order to improve scalability when running large clusters with many executors per worker. Theoretically different applications will generally have different timestamps for the same JAR so we could just as well use app ids for this but I'd rather not make such an invasive change now.
    - If we assume that `addFile()` will never be called with files with the same name and different contents, so as not brick executors, then allowing files to be downloaded so that the content-comparison can be performed on executors will lead to performance problems if `addFile` is repeatedly called with the same argument, since executors will repeatedly download the file despite it having identical contents.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14396: [SPARK-16787] SparkContext.addFile() should not t...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14396#discussion_r72836763
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -1430,14 +1430,10 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
           schemeCorrectedPath
         }
         val timestamp = System.currentTimeMillis
    -    addedFiles(key) = timestamp
    -
    -    // Fetch the file locally in case a job is executed using DAGScheduler.runLocally().
    -    Utils.fetchFile(path, new File(SparkFiles.getRootDirectory()), conf, env.securityManager,
    -      hadoopConfiguration, timestamp, useCache = false)
    -
    -    logInfo("Added file " + path + " at " + key + " with timestamp " + addedFiles(key))
    -    postEnvironmentUpdate()
    +    if (addedFiles.putIfAbsent(key, timestamp).isEmpty) {
    --- End diff --
    
    Is `putIfAbsent` correct here? It won't update the timestamp when you call `addFile` for the same file again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14396: [SPARK-16787] SparkContext.addFile() should not t...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14396#discussion_r72836856
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -1721,11 +1711,13 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
             }
           }
           if (key != null) {
    -        addedJars(key) = System.currentTimeMillis
    -        logInfo("Added JAR " + path + " at " + key + " with timestamp " + addedJars(key))
    +        val timestamp = System.currentTimeMillis
    +        if (addedJars.putIfAbsent(key, timestamp).isEmpty) {
    --- End diff --
    
    Same question re: `putIfAbsent`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    cc @vanzin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14396: [SPARK-16787] SparkContext.addFile() should not t...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14396


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14396: [SPARK-16787] SparkContext.addFile() should not t...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14396#discussion_r72854729
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -1430,14 +1430,10 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
           schemeCorrectedPath
         }
         val timestamp = System.currentTimeMillis
    -    addedFiles(key) = timestamp
    -
    -    // Fetch the file locally in case a job is executed using DAGScheduler.runLocally().
    -    Utils.fetchFile(path, new File(SparkFiles.getRootDirectory()), conf, env.securityManager,
    --- End diff --
    
    Yep, turns out this is needed in the Python tests so I added it back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14396: [SPARK-16787] SparkContext.addFile() should not t...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14396#discussion_r72839076
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -1430,14 +1430,10 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
           schemeCorrectedPath
         }
         val timestamp = System.currentTimeMillis
    -    addedFiles(key) = timestamp
    -
    -    // Fetch the file locally in case a job is executed using DAGScheduler.runLocally().
    -    Utils.fetchFile(path, new File(SparkFiles.getRootDirectory()), conf, env.securityManager,
    --- End diff --
    
    I believe that this line is unnecessary now that `runLocally` was removed a few releases ago. However, I suppose that we might need it in order to handle the somewhat obscure corner-case where a user's `reduce` function / closure executes on the driver and accesses the `SparkFiles` object.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    I'll have to see about the JAR case, but I think that the case of re-uploading a file behaves in an especially-confusing way if the contents are different: the `Utils.fetchFile()` call on the driver would cause the driver's `addFile()` call to throw an exception but this would happen after the `addedFiles(key) = timestamp` assignment, meaning that executors will already have been bricked.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    The change LGTM.
    
    I vaguely remember some people relying on the behavior that the new contents are uploaded when calling these methods a second time. e.g., to upload a new jar with updated code (and yes, I realize that is super sketchy to begin with and probably won't work as they expect). Maybe making the behavior clear in the scaladoc would help?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    **[Test build #62994 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62994/consoleFull)** for PR 14396 at commit [`0d7dd0d`](https://github.com/apache/spark/commit/0d7dd0d12ace12bf54c3d6b62b802ffa4de800a3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63083/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    **[Test build #63083 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63083/consoleFull)** for PR 14396 at commit [`9aa32b3`](https://github.com/apache/spark/commit/9aa32b394cd80c776ac567a662c9471351228b93).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62994/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by NYcleaner <gi...@git.apache.org>.
Github user NYcleaner commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    Oh. sorry. I am a newer , thank you for your patience and help


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14396: [SPARK-16787] SparkContext.addFile() should not throw if...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the issue:

    https://github.com/apache/spark/pull/14396
  
    @JoshRosen Thanks for your clarifying. LGTM!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org