You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/07 02:53:15 UTC

[GitHub] [spark] JoshRosen commented on pull request #36775: [SPARK-39389]Filesystem closed should not be considered as corrupt files

JoshRosen commented on PR #36775:
URL: https://github.com/apache/spark/pull/36775#issuecomment-1148136217

Does checking for filesystem closed exceptions completely fix this issue or are we vulnerable to race conditions? Skimming through the [Hadoop DFSClient code](https://github.com/apache/hadoop/blame/2dfa928a201560bc739ff5e4fe29f8bb19188927/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSClient.java), it looks like it only throws `Filesystem closed` IOExceptions from a `checkOpen()` call, but what happens if the FS is closed while a thread has proceeded past `checkOpen()` and is in the middle of some other operation? In that case we might get a different IOException but would presumably want to treat it the same way (because it is a side-effect of another task being interrupted rather than a side-effect of a corrupt file).

It seems like the core problem is that we have a really overly-broad `catch` block which catches corrupt files but also catches many other potential things. For example, let's say that we get an IOException caused by a transient network connectivity issue to external storage: this doesn't meet the intuitive definition of "corrupt file" but would still get caught in the dragnet of the current exception handler.

The current code seems biased towards identifying "false positive" instances of corruption (which can lead to correctness issues). If instead we wanted to bias towards false negatives (i.e. mis-identifying true corruption as a transient crash, therefore failing a user's job) then maybe we could have the code in `readFunc` wrap and re-throw exceptions in some sort of `CorruptFileException` wrapper and then modify the `catch` here to only ignore that new exception. This would require changes in a number of data sources, though. The FileScanRDD code might simply lack the necessary information to be able to identify true corruption cases, so pushing part of that decision one layer lower might make sense.

I think that could be a breaking change for certain users, though, so we'd need to treat it as a user-facing change and document it appropriately (and might need add escape hatch flags in case users need to revert back to the old (arguably buggy) behavior).

----

I'm not sure what's the right course of action here. I just wanted to flag that there's a potentially broader issue here.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org