You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Willi Raschkowski (Jira)" <ji...@apache.org> on 2022/02/09 12:54:00 UTC
[jira] [Commented] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition

    [ https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489523#comment-17489523 ] 

Willi Raschkowski commented on SPARK-38166:
-------------------------------------------

Attaching driver logs:  [^driver.log]

Notable lines are probably:
{code:java}
...
INFO  [2021-11-11T23:04:13.68737Z] org.apache.spark.scheduler.TaskSetManager: Task 1.1 in stage 6.0 (TID 60) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded).
INFO  [2021-11-11T23:04:13.687562Z] org.apache.spark.scheduler.DAGScheduler: Marking ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) as failed due to a fetch failure from ShuffleMapStage 5 (writeAndRead at CustomSaveDatasetCommand.scala:218)
INFO  [2021-11-11T23:04:13.688643Z] org.apache.spark.scheduler.DAGScheduler: ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) failed in 1012.545 s due to org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 2), which maintains the block data to fetch is dead.
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:663)
        ...
Caused by: org.apache.spark.ExecutorDeadException: The relative remote executor(Id: 2), which maintains the block data to fetch is dead.
        at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:132)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
        ...

INFO  [2021-11-11T23:04:13.690385Z] org.apache.spark.scheduler.DAGScheduler: Resubmitting ShuffleMapStage 5 (writeAndRead at CustomSaveDatasetCommand.scala:218) and ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) due to fetch failure
INFO  [2021-11-11T23:04:13.894248Z] org.apache.spark.scheduler.DAGScheduler: Resubmitting failed stages
...
{code}

> Duplicates after task failure in dropDuplicates and repartition
> ---------------------------------------------------------------
>
>                 Key: SPARK-38166
>                 URL: https://issues.apache.org/jira/browse/SPARK-38166
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.2
>         Environment: Cluster runs on K8s. AQE is enabled.
>            Reporter: Willi Raschkowski
>            Priority: Major
>              Labels: correctness
>         Attachments: driver.log
>
>
> We're seeing duplicates after running the following 
> {code}
> def compute_shipments(shipments):
>     shipments = shipments.dropDuplicates(["ship_trck_num"])
>     shipments = shipments.repartition(4)
>     return shipments
> {code}
> and observing lost executors (OOMs) and task retries in the repartition stage.
> We're seeing this reliably in one of our pipelines. But I haven't managed to reproduce outside of that pipeline. I'll attach driver logs and the notionalized input data - maybe you have ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org