You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/24 01:31:47 UTC

[GitHub] [spark] JoshRosen edited a comment on issue #24668: [SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles

JoshRosen edited a comment on issue #24668: [SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles
URL: https://github.com/apache/spark/pull/24668#issuecomment-495440700
 
 
   On further reflection, it's not _necessarily_ safe to ignore deletions at the root level because that still leaves us vulnerable to certain races (e.g. if we globStatus on the driver to list the first level, then pass those paths to `InMemoryFileIndex`, then delete one of the paths before `InMemoryFileIndex` begins its listing then we might miss data).
   
   However, if you actually _do_ delete underlying data on purpose then an explicit `REFRESH TABLE` is supposed to allow you to query the remaining data. This can create some interesting behavior inconsistencies: for example, attempting to initially create a table from a non-existent root path fails loudly with a "path not found" exception, but if you create a table from an existent root path, delete the path, and `REFRESH TABLE` then you'll have an empty table.
   
   Given these existing behaviors, it's somewhat tricky to fix the "root path throws FileNotFoundException" case without breaking existing behaviors.
   
   However, consider a workload which never does `REFRESH TABLE`: presumably every one of the rootPaths existed when the `InMemoryFileIndex` was initially constructed, so it should be fine to fail-fast during initial construction for non-existent paths but then ignore non-existence at the root during refresh!
   
   I'm going to give that a try now.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org