You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by habren <gi...@git.apache.org> on 2018/08/09 02:03:02 UTC

[GitHub] spark pull request #22018: [SPARK-25038][SQL] Get block location in parallel

Github user habren commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22018#discussion_r208788059
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala ---
    @@ -297,7 +297,7 @@ object InMemoryFileIndex extends Logging {
         val missingFiles = mutable.ArrayBuffer.empty[String]
         val filteredLeafStatuses = allLeafStatuses.filterNot(
           status => shouldFilterOut(status.getPath.getName))
    -    val resolvedLeafStatuses = filteredLeafStatuses.flatMap {
    +    val resolvedLeafStatuses = filteredLeafStatuses.par.flatMap {
    --- End diff --
    
    Thanks @maropu for your comments. I updated the title and description. Let's explain the difference between this change and the current parallel partition discovery. The current one will discovery different partitions in parallel. This change will get the block location for a single partition in parallel. When there is only a few partitions and each contains tons of thousands of files, the current partition discovery won't help. And this change can accelerate it in this case


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org