You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2016/07/26 13:19:20 UTC

[jira] [Commented] (SPARK-16736) remove redundant FileSystem status checks calls from Spark codebase

    [ https://issues.apache.org/jira/browse/SPARK-16736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15393770#comment-15393770 ] 

Steve Loughran commented on SPARK-16736:
----------------------------------------

See also HIVE-14323

> remove redundant FileSystem status checks calls from Spark codebase
> -------------------------------------------------------------------
>
>                 Key: SPARK-16736
>                 URL: https://issues.apache.org/jira/browse/SPARK-16736
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.0.0
>            Reporter: Steve Loughran
>            Priority: Minor
>
> The Hadoop {{FileSystem.exists()}} and {{FileSystem.isDirectory()}} calls are wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS NN, and very, very slow against object stores.
> # if these calls are followed by any getStatus() calls then they can be eliminated by careful merging and pulling out the catching of {FileNotFoundException}} from the exists() call to the spark code.
> # Any sequence of exists + delete can be optimised by removing the exists check, relying on {{FileSystem.delete()}} to be a no-op if the destination path is not present. That's a tested requirement of all Hadoop compatible FS and object stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org