You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/28 01:54:07 UTC

[GitHub] [spark] yaooqinn opened a new pull request, #38024: [SPARK-40591][SQL] Fix data loss caused by ignoreCorruptFiles

yaooqinn opened a new pull request, #38024:
URL: https://github.com/apache/spark/pull/38024

### What changes were proposed in this pull request?

Let's take a look at the case below, the left and the right are visiting the same table and its partitions, and both of them are ignoreCorruptFiles=true. The right side shows that a task skips partial of the data it reads because of encountering 'corrupt data', while the left read this file correctly. As ignoreCorruptFiles coarsely works with RuntimeException and IOException, it can not always represent data corruption.

![image](https://user-images.githubusercontent.com/8326978/192667546-30d20739-a322-4618-8fb7-b0fa24301bcc.png)

What's worse, such kinds of tasks are always marked as successful on the web UI. The same query visiting the same snapshot of data might result in inconsistency silently.

In this PR, we make the ignoreCorruptFiles work with taskAttemptNumber together, that is, only the last attempt will ignore the maybe-corrupted file. Users may want fewer retries in case of performance regressions, so ignoreCorruptFilesAfterRetries is introduced which can be set to less than `spark.task.maxFailures`.

### Why are the changes needed?

Fix data loss.

Also, the UI now contains failed tasks for both positive and negative data corruption which helps us in bug hunting.

### Does this PR introduce _any_ user-facing change?

No, it's a bug fix (maybe a UI change like what I said above).

### How was this patch tested?

tested locally and existing tests for ignoreCorruptFiles

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org