You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Aniket Bhatnagar (JIRA)" <ji...@apache.org> on 2016/11/04 21:22:59 UTC

[jira] [Commented] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass

    [ https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637731#comment-15637731 ] 

Aniket Bhatnagar commented on SPARK-18273:
------------------------------------------

Thanks [~srowen]. Didn't realize that I could actually pass glob pattern. Thank you so much.

> DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass 
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18273
>                 URL: https://issues.apache.org/jira/browse/SPARK-18273
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.0.1
>            Reporter: Aniket Bhatnagar
>            Priority: Minor
>
> If the paths Seq parameter contains a lot of elements, then DataFrameReader.load takes a lot of time starting the job as it attempts to check if each of the path exists using fs.exists. There should be a boolean configuration  option to disable the checking for path's existence and that should be passed in as parameter to DataSource.resolveRelation call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org