You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/22 19:09:54 UTC

[GitHub] [spark] rrusso2007 removed a comment on issue #24679: [SPARK-27807][SQL] Parallel resolve leaf statuses InMemoryFileIndex

rrusso2007 removed a comment on issue #24679: [SPARK-27807][SQL] Parallel resolve leaf statuses InMemoryFileIndex
URL: https://github.com/apache/spark/pull/24679#issuecomment-494906133

In my pull request #24672 I am switching to using a method that returns LocatedFileStatus and these lookups are unnecessary in that case. For HDFS then specifically this won't be necessary to do the optimization in this pull. Maybe if we do want to do an optimization like this pull for other file systems we should filter out the LocatedFileStatus first instead of detecting them in parallel. If we do that and they are all located already then there's no need to make a parallel collection to resolve them.

On another note, when doing the parallel resolution of multiple paths, the existing system uses a spark job to do parallelization as opposed to a parallel collection. There might be risk of launching many parallel threads in the driver like this unexpectedly as opposed to offloading this to the executors which have allocated cores.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org