You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/04/01 15:03:50 UTC

[GitHub] [spark] srowen commented on issue #24237: [SPARK-27319][SQL] Filter out dir based on PathFilter before listing them

srowen commented on issue #24237: [SPARK-27319][SQL] Filter out dir based on PathFilter before listing them
URL: https://github.com/apache/spark/pull/24237#issuecomment-478616785
 
 
   Hm, I feel like I am still missing something about the implementation. @adrian-ionescu could I ask you to look at the logic here? I think you implemented a lot of the code in question.
   
   There is some value in filtering out dirs as listLeafFiles / bulkListLeafFiles recurses through the tree, because a filter might try to match intermediate directories and can filter them. I'm worried that a filter like `.endsWith(".tmp")` might match dirs instead of leaf files. But otherwise is this a good optimization?
   
   While a user can filter out top-level dirs they don't actually want to examine, I think there's a decent point here about more complex filters on nested intermediate dirs.
   
   If we're worried about matching intermediate directories for patterns intended to match leaf file paths, maybe it's possible to filter after listing dirs, but check the filter against the dir path plus "/" at  the end if it doesn't already end in "/". 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org