You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Wei Guo (Jira)" <ji...@apache.org> on 2022/07/28 14:36:00 UTC

[jira] [Commented] (SPARK-39901) Reconsider design of ignoreCorruptFiles feature

    [ https://issues.apache.org/jira/browse/SPARK-39901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572502#comment-17572502 ] 

Wei Guo commented on SPARK-39901:
---------------------------------

The `ignoreCorruptFiles` features in SQL(spark.sql.files.ignoreCorruptFiles) and RDD(spark.files.ignoreCorruptFiles) scenarios need to be included both. 

> Reconsider design of ignoreCorruptFiles feature
> -----------------------------------------------
>
>                 Key: SPARK-39901
>                 URL: https://issues.apache.org/jira/browse/SPARK-39901
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Josh Rosen
>            Priority: Major
>
> I'm filing this ticket as a followup to the discussion at [https://github.com/apache/spark/pull/36775#issuecomment-1148136217] regarding the `ignoreCorruptFiles` feature: the current implementation is based towards considering a broad range of IOExceptions to be corruption, but this is likely overly-broad and might mis-identify transient errors as corruption (causing non-corrupt data to be erroneously discarded).
> SPARK-39389 fixes one instance of that problem, but we are still vulnerable to similar issues because of the overall design of this feature.
> I think we should reconsider the design of this feature: maybe we should switch the default behavior so that only an explicit allowlist of known corruption exceptions can cause files to be skipped. This could be done through involvement of other parts of the code, e.g. rewrapping exceptions into a `CorruptFileException` so higher layers can positively identify corruption.
> Any changes to behavior here could potentially impact users jobs, so we'd need to think carefully about when we want to change (in a 3.x release? 4.x?) and how we want to provide escape hatches (e.g. configs to revert back to old behavior). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org