You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:21:39 UTC

[jira] [Updated] (SPARK-8132) Race condition if task is cancelled with interruption while fetching file dependencies

     [ https://issues.apache.org/jira/browse/SPARK-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-8132:
--------------------------------
    Labels: bulk-closed  (was: )

> Race condition if task is cancelled with interruption while fetching file dependencies
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-8132
>                 URL: https://issues.apache.org/jira/browse/SPARK-8132
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.3.1, 1.4.0
>            Reporter: Josh Rosen
>            Priority: Major
>              Labels: bulk-closed
>
> This is a borderline impossible-to-reproduce bug:
> If {{spark.files.overwrite = false}} (the default) and a Spark executor is fetching large file dependencies from the driver _and_ the first task that triggered file dependency loading is cancelled after it has started copying / moving the downloaded file to its target directory, then the executor may be put into a bad state where all subsequent tasks fail with errors about refusing to overwrite an existing file because its contents differ from the file being fetched.
> There are a few ways to mitigate this:
> - Set {{spark.files.overwrite = false}}.  We should probably remove or deprecate this configuration: the only reason that it was added was to work around an obscure Spark 0.8-era bug where Spark would delete files out of the driver's CWD when running tasks in local mode.  This concern may have been mitigated by other changes.  Regardless, there are many environments where this feature can safely be disabled.
> - Disable {{spark.files.useFetchCache}}, which should probably be off by default (see SPARK-8130); this will shorten the window over which the race can occur.
> - Catch InterruptedException and perform cleanup in our file moving / copying code; this is somewhat tricky to reason about / get right because the right behavior differs based on whether we're overwriting or creating a new file.
> Given that this can be fixed with conf changes for the cases that i've seen, I'm not sure that this needs to be a high-priority fix, although I would be glad to review patches to clean up / audit this code to properly fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org