You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Chandni Singh (Jira)" <ji...@apache.org> on 2021/03/23 16:07:00 UTC
[jira] [Created] (SPARK-34840) Fix cases of corruption in merged
shuffle blocks that are pushed
Chandni Singh created SPARK-34840:
-------------------------------------
Summary: Fix cases of corruption in merged shuffle blocks that are pushed
Key: SPARK-34840
URL: https://issues.apache.org/jira/browse/SPARK-34840
Project: Spark
Issue Type: Bug
Components: Shuffle
Affects Versions: 3.1.0
Reporter: Chandni Singh
The {{RemoteBlockPushResolver}} which handles the shuffle push blocks and merges them was introduced in [#30062|https://github.com/apache/spark/pull/30062]. We have identified 2 scenarios where the merged blocks get corrupted:
# {{StreamCallback.onFailure()}} is called more than once. Initially we assumed that the onFailure callback will be called just once per stream. However, we observed that this is called twice when a client connection is reset. When the client connection is reset then there are 2 events that get triggered in this order.
* {{exceptionCaught}}. This event is propagated to {{StreamInterceptor}}. {{StreamInterceptor.exceptionCaught()}} invokes {{callback.onFailure(streamId, cause)}}. This is the first time StreamCallback.onFailure() will be invoked.
* {{channelInactive}}. Since the channel closes, the {{channelInactive}} event gets triggered which again is propagated to {{StreamInterceptor}}. {{StreamInterceptor.channelInactive()}} invokes {{callback.onFailure(streamId, new ClosedChannelException())}}. This is the second time StreamCallback.onFailure() will be invoked.
# The flag {{isWriting}} is set prematurely to true. This introduces an edge case where a stream that is trying to merge a duplicate block (created because of a speculative task) may interfere with an active stream if the duplicate stream fails.
Also adding additional changes that improve the code.
# Using positional writes all the time because this simplifies the code and with microbenchmarking haven't seen any performance impact.
# Additional minor changes.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org