You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Willi Raschkowski (Jira)" <ji...@apache.org> on 2022/02/09 12:54:00 UTC
[jira] [Commented] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition
[ https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489523#comment-17489523 ]
Willi Raschkowski commented on SPARK-38166:
-------------------------------------------
Attaching driver logs: [^driver.log]
Notable lines are probably:
{code:java}
...
INFO [2021-11-11T23:04:13.68737Z] org.apache.spark.scheduler.TaskSetManager: Task 1.1 in stage 6.0 (TID 60) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded).
INFO [2021-11-11T23:04:13.687562Z] org.apache.spark.scheduler.DAGScheduler: Marking ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) as failed due to a fetch failure from ShuffleMapStage 5 (writeAndRead at CustomSaveDatasetCommand.scala:218)
INFO [2021-11-11T23:04:13.688643Z] org.apache.spark.scheduler.DAGScheduler: ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) failed in 1012.545 s due to org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 2), which maintains the block data to fetch is dead.
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:663)
...
Caused by: org.apache.spark.ExecutorDeadException: The relative remote executor(Id: 2), which maintains the block data to fetch is dead.
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:132)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
...
INFO [2021-11-11T23:04:13.690385Z] org.apache.spark.scheduler.DAGScheduler: Resubmitting ShuffleMapStage 5 (writeAndRead at CustomSaveDatasetCommand.scala:218) and ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) due to fetch failure
INFO [2021-11-11T23:04:13.894248Z] org.apache.spark.scheduler.DAGScheduler: Resubmitting failed stages
...
{code}
> Duplicates after task failure in dropDuplicates and repartition
> ---------------------------------------------------------------
>
> Key: SPARK-38166
> URL: https://issues.apache.org/jira/browse/SPARK-38166
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.0.2
> Environment: Cluster runs on K8s. AQE is enabled.
> Reporter: Willi Raschkowski
> Priority: Major
> Labels: correctness
> Attachments: driver.log
>
>
> We're seeing duplicates after running the following
> {code}
> def compute_shipments(shipments):
> shipments = shipments.dropDuplicates(["ship_trck_num"])
> shipments = shipments.repartition(4)
> return shipments
> {code}
> and observing lost executors (OOMs) and task retries in the repartition stage.
> We're seeing this reliably in one of our pipelines. But I haven't managed to reproduce outside of that pipeline. I'll attach driver logs and the notionalized input data - maybe you have ideas.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org