You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Gaurav Shah (JIRA)" <ji...@apache.org> on 2016/09/16 12:38:20 UTC

[jira] [Commented] (SPARK-16121) ListingFileCatalog does not list in parallel anymore

    [ https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15496234#comment-15496234 ] 

Gaurav Shah commented on SPARK-16121:
-------------------------------------

[~mengxr] was this fixed in 2.0.0 or is it planned for 2.0.1, My partition discovery takes about  10 minutes and I guess this should fix it

> ListingFileCatalog does not list in parallel anymore
> ----------------------------------------------------
>
>                 Key: SPARK-16121
>                 URL: https://issues.apache.org/jira/browse/SPARK-16121
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Yin Huai
>            Assignee: Yin Huai
>            Priority: Blocker
>             Fix For: 2.0.0
>
>
> In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown below. When the number of user-provided paths is less than the value of {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we will not use parallel listing, which is different from what 1.6 does (for 1.6, if the number of children of any inner dir is larger than the threshold, we will use the parallel listing).
> {code}
> protected def listLeafFiles(paths: Seq[Path]): mutable.LinkedHashSet[FileStatus] = {
>     if (paths.length >= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
>       HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sparkSession)
>     } else {
>       // Dummy jobconf to get to the pathFilter defined in configuration
>       val jobConf = new JobConf(hadoopConf, this.getClass)
>       val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
>       val statuses: Seq[FileStatus] = paths.flatMap { path =>
>         val fs = path.getFileSystem(hadoopConf)
>         logInfo(s"Listing $path on driver")
>         Try {
>           HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), pathFilter)
>         }.getOrElse(Array.empty[FileStatus])
>       }
>       mutable.LinkedHashSet(statuses: _*)
>     }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org