You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/12 21:31:13 UTC

[GitHub] [spark] sunchao opened a new pull request #30019: [SPARK-32381][CORE][SQL][FOLLOWUP] Reply on listLocatedStatus from FileSystem implementations

sunchao opened a new pull request #30019:
URL: https://github.com/apache/spark/pull/30019


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This PR removes the restriction in `HadoopFSUtils.parallelListLeafFiles` which only calls `listLocatedStatus` when the `FileSystem` is either `DistributedFileSystem` or `ViewFileSystem`. Instead, this checks locality flag and calls `listLocatedStatus` when that is set.
   
   Consequently, this also removes the special handling when locality info is required but the file system impl is neither above. Currently in this case we'd call `getFileBlockLocations` on each `FileStatus` obtained, and do some handling on missing files as well.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   `HadoopFsUtils.parallelListLeafFiles` currently only calls `listLocatedStatus` when the `FileSystem` impl is `DistributedFileSystem` or `ViewFileSystem`. For other types of `FileSystem`, it calls `listStatus` and then subsequently calls `getFileBlockLocations` on all the result `FileStatus`es. 
   
   In Hadoop client, `listLocatedStatus` is a well-defined API and in fact it is often overridden by specific file system implementations, such as S3A. The default `listLocatedStatus` also has similar behavior as it's done in Spark.
   
   Therefore, instead of re-implement the logic in Spark itself, it's better to rely on the `FileSystem`-specific implementation for `listLocatedStatus`, which could include its own optimizations in the code path.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   There could be some behavior change when `spark.sql.files.ignoreMissingFiles` is set and there are race conditions during listing. 
   
   For instance, assume locality info is required and some files in the listing result _were deleted right after the listing op, but before the subsequent `getFileBlockLocations` calls_ .  in the previous code. The previous implementation would return partial listing result, but the current impl will return an empty set, since the default `FileSystem.listLocatedStatus` will call `getFileBlockLocations` internally which will cause it fail as a whole.
   
   On the other hand, it's also possible that new implementation returns a full list rather than a partial list in the previous impl, since we don't call `getFileBlockLocations` as a separate step now which eliminates some possibility that a file was deleted before the `getFileBlockLocations` as in the previous impl.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   Relying on existing tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #30019: [SPARK-32381][CORE][SQL][FOLLOWUP] Reply on listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707357103


   **[Test build #129707 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129707/testReport)** for PR 30019 at commit [`651f6a6`](https://github.com/apache/spark/commit/651f6a625363e141b5e9b63d33984a4aaab2a575).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #30019:
URL: https://github.com/apache/spark/pull/30019#discussion_r508730399



##########
File path: core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala
##########
@@ -207,18 +207,14 @@ private[spark] object HadoopFSUtils extends Logging {
     // Note that statuses only include FileStatus for the files and dirs directly under path,
     // and does not include anything else recursively.
     val statuses: Array[FileStatus] = try {
-      fs match {
-        // DistributedFileSystem overrides listLocatedStatus to make 1 single call to namenode
-        // to retrieve the file status with the file block location. The reason to still fallback
-        // to listStatus is because the default implementation would potentially throw a
-        // FileNotFoundException which is better handled by doing the lookups manually below.
-        case (_: DistributedFileSystem | _: ViewFileSystem) if !ignoreLocality =>
-          val remoteIter = fs.listLocatedStatus(path)
-          new Iterator[LocatedFileStatus]() {
-            def next(): LocatedFileStatus = remoteIter.next
-            def hasNext(): Boolean = remoteIter.hasNext
-          }.toArray
-        case _ => fs.listStatus(path)
+      if (ignoreLocality) {
+        fs.listStatus(path)

Review comment:
       sg - I'll switch to `listStatusIterator` and create a wrapper class for the returned `RemoteIterator` in both cases.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-712452329


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30019: [SPARK-32381][CORE][SQL][FOLLOWUP] Reply on listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707424439


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129707/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-712452305


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34624/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-712420537


   **[Test build #130017 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130017/testReport)** for PR 30019 at commit [`651f6a6`](https://github.com/apache/spark/commit/651f6a625363e141b5e9b63d33984a4aaab2a575).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-713221562


   **[Test build #130059 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130059/testReport)** for PR 30019 at commit [`3c0ad25`](https://github.com/apache/spark/commit/3c0ad259bea09095da30863af9e614e23cfff0d6).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-712440883


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34624/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-713179916


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34668/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

sunchao commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707856299


   I still need to look into failing tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-713222113






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

sunchao commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-715505865


   @steveloughran that will be great, I'm also exploring to having Spark directly use Hadoop 3.3.0+ in #30135 but it seems there are some bugs from Hadoop side which caused test failures.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30019: [SPARK-32381][CORE][SQL][FOLLOWUP] Reply on listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707381929






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-713222113






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] steveloughran commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

steveloughran commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-716547112


   @sunchao if you do want backporting, probably easiest if you lift it so I can review, though I could probably abuse the "I'm cherry picking a subset of a larger patch" rule to do it to...I'd be pulling back the hadoop-common changes without any of the s3a optimisations. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] steveloughran commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

steveloughran commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-714490460


   >> The new openFile() API will let the caller specify seek policy (sequential, random, adaptive,...) and, if you pass in the known file length, skip the HEAD. It's also an async operation on S3A, even when a HEAD is needed.
   
   > Cool. This is in 3.3.0+ though. We can perhaps explore this once Spark made the switch.
   
   if someone wants to do the work, we could backport the API (with no optimised implementations) to 3.2.x. Same for other public FileSystem API changes. Here work:== backport API plus subset of tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

sunchao commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-713055884


   Thanks @steveloughran , for putting all the context on S3A/ABFS, and sorry for the late comment.
   
   > HEAD: don't call exists/getFileStatus/etc if you know a file is there. It's also better to do a try { open() } catch (f: FileNotFoundException) than it is for a probe + open, as youl will save a HEAD.
   
   Good point. I guess this applies in general to other FS-impls as well. I can spend some time checking Spark codebase for this pattern and find potential improvements.
   
   > The new openFile() API will let the caller specify seek policy (sequential, random, adaptive,...) and, if you pass in the known file length, skip the HEAD. It's also an async operation on S3A, even when a HEAD is needed.
   
   Cool. This is in 3.3.0+ though. We can perhaps explore this once Spark made the switch.
   
   > LIST ...
   
   Yes I wish Spark can benefit from the paged listing (I know Presto has [optimizations](https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java) around this and it seems to work really well for them) from `FileSystem` impls but it will need some significant changes. 
   
   > What you can do right now is add an option to log the toString() values of input streams/output streams/remoteiterators at debug level to some performance-only log
   
   Sounds good. Let me try to add these.
   
   > I'd love to get some comparisons here.
   
   I'll see what I can do ... without the incremental listing support I think we won't see much difference here? plus I don't have a testing environment for S3A at hand :(
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30019: [SPARK-32381][CORE][SQL][FOLLOWUP] Reply on listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707423910


   **[Test build #129707 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129707/testReport)** for PR 30019 at commit [`651f6a6`](https://github.com/apache/spark/commit/651f6a625363e141b5e9b63d33984a4aaab2a575).
    * This patch **fails SparkR unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-713189478






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30019: [SPARK-32381][CORE][SQL][FOLLOWUP] Reply on listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707375650


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34313/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30019: [SPARK-32381][CORE][SQL][FOLLOWUP] Reply on listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707381920


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34313/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30019: [SPARK-32381][CORE][SQL][FOLLOWUP] Reply on listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707357103


   **[Test build #129707 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129707/testReport)** for PR 30019 at commit [`651f6a6`](https://github.com/apache/spark/commit/651f6a625363e141b5e9b63d33984a4aaab2a575).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30019: [SPARK-32381][CORE][SQL][FOLLOWUP] Reply on listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707381929






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #30019:
URL: https://github.com/apache/spark/pull/30019#discussion_r508729593



##########
File path: core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala
##########
@@ -207,18 +207,14 @@ private[spark] object HadoopFSUtils extends Logging {
     // Note that statuses only include FileStatus for the files and dirs directly under path,
     // and does not include anything else recursively.
     val statuses: Array[FileStatus] = try {
-      fs match {
-        // DistributedFileSystem overrides listLocatedStatus to make 1 single call to namenode
-        // to retrieve the file status with the file block location. The reason to still fallback
-        // to listStatus is because the default implementation would potentially throw a
-        // FileNotFoundException which is better handled by doing the lookups manually below.
-        case (_: DistributedFileSystem | _: ViewFileSystem) if !ignoreLocality =>
-          val remoteIter = fs.listLocatedStatus(path)
-          new Iterator[LocatedFileStatus]() {
-            def next(): LocatedFileStatus = remoteIter.next
-            def hasNext(): Boolean = remoteIter.hasNext
-          }.toArray
-        case _ => fs.listStatus(path)
+      if (ignoreLocality) {
+        fs.listStatus(path)
+      } else {
+        val remoteIter = fs.listLocatedStatus(path)
+        new Iterator[LocatedFileStatus]() {
+          def next(): LocatedFileStatus = remoteIter.next
+          def hasNext(): Boolean = remoteIter.hasNext
+        }.toArray

Review comment:
       Yes it would be lovely if we can get async listing here, but I think it requires a much bigger surgery - up to the top level currently Spark's RDD model requires all the input partitions to be ready before it can start processing (deeply embedded in its primitives such as map/reduce).
   
   We can perhaps add the async logic here in this class but I think "local" processing we're doing here is far cheaper than the remote listing and perhaps can't gain much from the change.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] steveloughran commented on a change in pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

steveloughran commented on a change in pull request #30019:
URL: https://github.com/apache/spark/pull/30019#discussion_r504525199



##########
File path: core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala
##########
@@ -207,18 +207,14 @@ private[spark] object HadoopFSUtils extends Logging {
     // Note that statuses only include FileStatus for the files and dirs directly under path,
     // and does not include anything else recursively.
     val statuses: Array[FileStatus] = try {
-      fs match {
-        // DistributedFileSystem overrides listLocatedStatus to make 1 single call to namenode
-        // to retrieve the file status with the file block location. The reason to still fallback
-        // to listStatus is because the default implementation would potentially throw a
-        // FileNotFoundException which is better handled by doing the lookups manually below.
-        case (_: DistributedFileSystem | _: ViewFileSystem) if !ignoreLocality =>
-          val remoteIter = fs.listLocatedStatus(path)
-          new Iterator[LocatedFileStatus]() {
-            def next(): LocatedFileStatus = remoteIter.next
-            def hasNext(): Boolean = remoteIter.hasNext
-          }.toArray
-        case _ => fs.listStatus(path)
+      if (ignoreLocality) {
+        fs.listStatus(path)
+      } else {
+        val remoteIter = fs.listLocatedStatus(path)
+        new Iterator[LocatedFileStatus]() {

Review comment:
       Might be good to pull this out into something reusable for any `RemoteIterator[T]` as it gets used in a number of API calls (all because of java's checked exceptions...)

##########
File path: core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala
##########
@@ -207,18 +207,14 @@ private[spark] object HadoopFSUtils extends Logging {
     // Note that statuses only include FileStatus for the files and dirs directly under path,
     // and does not include anything else recursively.
     val statuses: Array[FileStatus] = try {
-      fs match {
-        // DistributedFileSystem overrides listLocatedStatus to make 1 single call to namenode
-        // to retrieve the file status with the file block location. The reason to still fallback
-        // to listStatus is because the default implementation would potentially throw a
-        // FileNotFoundException which is better handled by doing the lookups manually below.
-        case (_: DistributedFileSystem | _: ViewFileSystem) if !ignoreLocality =>
-          val remoteIter = fs.listLocatedStatus(path)
-          new Iterator[LocatedFileStatus]() {
-            def next(): LocatedFileStatus = remoteIter.next
-            def hasNext(): Boolean = remoteIter.hasNext
-          }.toArray
-        case _ => fs.listStatus(path)
+      if (ignoreLocality) {
+        fs.listStatus(path)

Review comment:
       switch to listStatusIterator(path) and again, provide a remoteIterator. This will give you on paged downloads on hdfs, webhdfs, async page prefetch on latest S3A builds, and, at worst elsewhere, exactly the same performance a listStatus

##########
File path: core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala
##########
@@ -207,18 +207,14 @@ private[spark] object HadoopFSUtils extends Logging {
     // Note that statuses only include FileStatus for the files and dirs directly under path,
     // and does not include anything else recursively.
     val statuses: Array[FileStatus] = try {
-      fs match {
-        // DistributedFileSystem overrides listLocatedStatus to make 1 single call to namenode
-        // to retrieve the file status with the file block location. The reason to still fallback
-        // to listStatus is because the default implementation would potentially throw a
-        // FileNotFoundException which is better handled by doing the lookups manually below.
-        case (_: DistributedFileSystem | _: ViewFileSystem) if !ignoreLocality =>
-          val remoteIter = fs.listLocatedStatus(path)
-          new Iterator[LocatedFileStatus]() {
-            def next(): LocatedFileStatus = remoteIter.next
-            def hasNext(): Boolean = remoteIter.hasNext
-          }.toArray
-        case _ => fs.listStatus(path)
+      if (ignoreLocality) {
+        fs.listStatus(path)
+      } else {
+        val remoteIter = fs.listLocatedStatus(path)
+        new Iterator[LocatedFileStatus]() {
+          def next(): LocatedFileStatus = remoteIter.next
+          def hasNext(): Boolean = remoteIter.hasNext
+        }.toArray

Review comment:
       the longer you can incrementally do per entry in the remote iterator, the more latencies talking to the object stores can be hidden. See HADOOP-17074 and HADOOP-17023 for details; one of the PRs shows some numbers there. 
   
   If the spark API could return an iterator/yield and the processing of it used that, a lot of that listing cost could be absorbed entirely.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-712452329






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-712480306


   **[Test build #130017 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130017/testReport)** for PR 30019 at commit [`651f6a6`](https://github.com/apache/spark/commit/651f6a625363e141b5e9b63d33984a4aaab2a575).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xkrogen commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

xkrogen commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707866922


   +1 on this effort from me @sunchao ! Checking FS instance types is brittle and has caused us lots of headaches when we leverage wrapper-type FS instances. Allowing the `FileSystem` instance to understand its own capabilities and delegate appropriately is much better.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-713163024


   **[Test build #130059 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130059/testReport)** for PR 30019 at commit [`3c0ad25`](https://github.com/apache/spark/commit/3c0ad259bea09095da30863af9e614e23cfff0d6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30019: [SPARK-32381][CORE][SQL][FOLLOWUP] Reply on listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707424432


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #30019:
URL: https://github.com/apache/spark/pull/30019#discussion_r508729593



##########
File path: core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala
##########
@@ -207,18 +207,14 @@ private[spark] object HadoopFSUtils extends Logging {
     // Note that statuses only include FileStatus for the files and dirs directly under path,
     // and does not include anything else recursively.
     val statuses: Array[FileStatus] = try {
-      fs match {
-        // DistributedFileSystem overrides listLocatedStatus to make 1 single call to namenode
-        // to retrieve the file status with the file block location. The reason to still fallback
-        // to listStatus is because the default implementation would potentially throw a
-        // FileNotFoundException which is better handled by doing the lookups manually below.
-        case (_: DistributedFileSystem | _: ViewFileSystem) if !ignoreLocality =>
-          val remoteIter = fs.listLocatedStatus(path)
-          new Iterator[LocatedFileStatus]() {
-            def next(): LocatedFileStatus = remoteIter.next
-            def hasNext(): Boolean = remoteIter.hasNext
-          }.toArray
-        case _ => fs.listStatus(path)
+      if (ignoreLocality) {
+        fs.listStatus(path)
+      } else {
+        val remoteIter = fs.listLocatedStatus(path)
+        new Iterator[LocatedFileStatus]() {
+          def next(): LocatedFileStatus = remoteIter.next
+          def hasNext(): Boolean = remoteIter.hasNext
+        }.toArray

Review comment:
       Yes it would be lovely if we can get async listing here, but I think it requires a much bigger surgery - up to the top level currently Spark's RDD model requires all the input partitions to be ready before it can start processing (deeply embedded in its primitives such as map/reduce).
   
   We can perhaps add the async logic here in this class but I think "local" processing we're doing here is far cheaper than the remote listing and perhaps can't gain much from the change.
   
   We can wrap the iterator and make it looks like a lazy array until certain info is needed but again I think it won't go very far until we make extensive changes in upper stack like in `PartitioningAwareFileIndex` or `DataSourceScanExec`. Anyways I'll perhaps try this in a separate PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on pull request #30019: [SPARK-32381][CORE][SQL][FOLLOWUP] Reply on listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

sunchao commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707421105


   Will do @dongjoon-hyun . Should I open a new PR or change title for this one?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

github-actions[bot] closed pull request #30019:
URL: https://github.com/apache/spark/pull/30019


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-713163024


   **[Test build #130059 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130059/testReport)** for PR 30019 at commit [`3c0ad25`](https://github.com/apache/spark/commit/3c0ad259bea09095da30863af9e614e23cfff0d6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-713189444


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34668/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

github-actions[bot] closed pull request #30019:
URL: https://github.com/apache/spark/pull/30019


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-713189469






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-712480911


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/130017/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30019: [SPARK-32381][CORE][SQL][FOLLOWUP] Reply on listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707424432






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-712452337


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34624/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-712420537


   **[Test build #130017 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130017/testReport)** for PR 30019 at commit [`651f6a6`](https://github.com/apache/spark/commit/651f6a625363e141b5e9b63d33984a4aaab2a575).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-712480904






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-772936908


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-710670129


   cc @holdenk 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-712418642


   Retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

sunchao commented on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-707856091


   @dongjoon-hyun created [SPARK-33135](https://issues.apache.org/jira/browse/SPARK-33135) and changed this PR's title.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30019:
URL: https://github.com/apache/spark/pull/30019#issuecomment-712480904


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #30019: [SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #30019:
URL: https://github.com/apache/spark/pull/30019#discussion_r508729593



##########
File path: core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala
##########
@@ -207,18 +207,14 @@ private[spark] object HadoopFSUtils extends Logging {
     // Note that statuses only include FileStatus for the files and dirs directly under path,
     // and does not include anything else recursively.
     val statuses: Array[FileStatus] = try {
-      fs match {
-        // DistributedFileSystem overrides listLocatedStatus to make 1 single call to namenode
-        // to retrieve the file status with the file block location. The reason to still fallback
-        // to listStatus is because the default implementation would potentially throw a
-        // FileNotFoundException which is better handled by doing the lookups manually below.
-        case (_: DistributedFileSystem | _: ViewFileSystem) if !ignoreLocality =>
-          val remoteIter = fs.listLocatedStatus(path)
-          new Iterator[LocatedFileStatus]() {
-            def next(): LocatedFileStatus = remoteIter.next
-            def hasNext(): Boolean = remoteIter.hasNext
-          }.toArray
-        case _ => fs.listStatus(path)
+      if (ignoreLocality) {
+        fs.listStatus(path)
+      } else {
+        val remoteIter = fs.listLocatedStatus(path)
+        new Iterator[LocatedFileStatus]() {
+          def next(): LocatedFileStatus = remoteIter.next
+          def hasNext(): Boolean = remoteIter.hasNext
+        }.toArray

Review comment:
       Yes it would be lovely if we can get async listing here, but I think it requires a much bigger surgery - up to the top level currently Spark's RDD model requires all the input partitions to be ready before it can start processing (deeply embedded in its primitives such as map/reduce).
   
   We can perhaps add the async logic here in this class but I think "local" processing we're doing here is far cheaper than the remote listing and perhaps can't gain much from the change.
   
   We can wrap the iterator and make it looks like an array until certain info is needed but again I think it won't go very far until we make extensive changes in upper stack like in `PartitioningAwareFileIndex` or `DataSourceScanExec`. Anyways I'll perhaps try this in a separate PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org