You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marcelo Vanzin (JIRA)" <ji...@apache.org> on 2016/08/17 18:44:20 UTC

[jira] [Resolved] (SPARK-16736) remove redundant FileSystem status checks calls from Spark codebase

     [ https://issues.apache.org/jira/browse/SPARK-16736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcelo Vanzin resolved SPARK-16736.
------------------------------------
       Resolution: Fixed
         Assignee: Steve Loughran
    Fix Version/s: 2.1.0

> remove redundant FileSystem status checks calls from Spark codebase
> -------------------------------------------------------------------
>
>                 Key: SPARK-16736
>                 URL: https://issues.apache.org/jira/browse/SPARK-16736
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.0.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>             Fix For: 2.1.0
>
>
> The Hadoop {{FileSystem.exists()}} and {{FileSystem.isDirectory()}} calls are wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS NN, and very, very slow against object stores.
> # if these calls are followed by any getStatus() calls then they can be eliminated by careful merging and pulling out the catching of {FileNotFoundException}} from the exists() call to the spark code.
> # Any sequence of exists + delete can be optimised by removing the exists check, relying on {{FileSystem.delete()}} to be a no-op if the destination path is not present. That's a tested requirement of all Hadoop compatible FS and object stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org