You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/23 22:06:19 UTC

[GitHub] [spark] holdenk commented on pull request #29179: [WIP][SPARK-32381][CORE][SQL] Explore allowing parallel listing & non-location sensitive listing in core

holdenk commented on pull request #29179:
URL: https://github.com/apache/spark/pull/29179#issuecomment-663255596


   > There's potential here, I'm curious about the numbers
   > 
   > * try and do incremental result generation though remote iterators, yield etc. That way the ability to do async fetch of the next page of results from the store while the app is going through its first page https://issues.apache.org/jira/browse/HADOOP-17074 can deliver tangible benefits. You also avoid the out of memory problems related to directories with a few million files being represented as arrays with a few million FileStatus entries.
   > * use listLocatedStatus everywhere and I'll see about getting the azure developers to speed up the abfs implementation
   So I also want to support querying HDFS where we know it's disagreggated.
   > * in [apache/hadoop#2069](https://github.com/apache/hadoop/pull/2069) the S3a remoteiterators implement the new IOStatisticsAPI; LocatedFileStatusFetcher will collect and aggregate the results. If you use the iterator APIs it should be possible to do the same thing
   Interesting. Is this specific to the S3A impl or is there a higher base class? I want to make it work with multiple file formats if possible.
   > * even without that, if you can collect/report listing times that could be useful.
   So we'll sort of semi-implicitly have it from the job statistics, but not in a very easy access to form. I could use an accumulator to keep track of it to allow multi-worker fan out.
   > 
   > Finally -Is the idea here to actually push the scan out across the cluster or just to do it multithreaded in the spark driver process?
   
   The idea here is to push it out to the workers (in part per-host rate limiting) but also matching the code we have in the SQL side so we have less maintianence cost.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org