You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/03/22 03:43:46 UTC

[GitHub] [spark] WangGuangxin opened a new pull request #24175: [SPARK-27232][SQL]Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to zero

WangGuangxin opened a new pull request #24175: [SPARK-27232][SQL]Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to zero
URL: https://github.com/apache/spark/pull/24175
 
 
   ## What changes were proposed in this pull request?
   
   `InMemoryFileIndex` needs to request file block location information in order to do locality schedule in `TaskSetManager`. 
   
   Usually this is a time-cost task.  For example, In our production env, there are 24 partitions, with totally 149925 files and 83TB in size. It costs about 10 minutes to request file block locations before submit a spark job. Even though I set `spark.sql.sources.parallelPartitionDiscovery.threshold` to 24 to make it parallelized, it also needs 2 minutes. 
   
   Anyway, this is a waste if we don't care about the locality of files(for example, storage and computation are separate).
   
   So there should be a conf to control whether we need to send `getFileBlockLocations` request to HDFS NN. If user set `spark.locality.wait` to 0, file block location information is meaningless. 
   
   Here in this PR, if `spark.locality.wait` is set to 0, it will not request file location information anymore, which will save several seconds to minutes.
   
   ## How was this patch tested?
   
   tested manually
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org