You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Chandni Singh (Jira)" <ji...@apache.org> on 2021/03/23 16:07:00 UTC

[jira] [Created] (SPARK-34840) Fix cases of corruption in merged shuffle blocks that are pushed

Chandni Singh created SPARK-34840:
-------------------------------------

             Summary: Fix cases of corruption in merged shuffle blocks that are pushed
                 Key: SPARK-34840
                 URL: https://issues.apache.org/jira/browse/SPARK-34840
             Project: Spark
          Issue Type: Bug
          Components: Shuffle
    Affects Versions: 3.1.0
            Reporter: Chandni Singh


The {{RemoteBlockPushResolver}} which handles the shuffle push blocks and merges them was introduced in [#30062|https://github.com/apache/spark/pull/30062]. We have identified 2 scenarios where the merged blocks get corrupted:
 # {{StreamCallback.onFailure()}} is called more than once. Initially we assumed that the onFailure callback will be called just once per stream. However, we observed that this is called twice when a client connection is reset. When the client connection is reset then there are 2 events that get triggered in this order.

 * {{exceptionCaught}}. This event is propagated to {{StreamInterceptor}}. {{StreamInterceptor.exceptionCaught()}} invokes {{callback.onFailure(streamId, cause)}}. This is the first time StreamCallback.onFailure() will be invoked.
 * {{channelInactive}}. Since the channel closes, the {{channelInactive}} event gets triggered which again is propagated to {{StreamInterceptor}}. {{StreamInterceptor.channelInactive()}} invokes {{callback.onFailure(streamId, new ClosedChannelException())}}. This is the second time StreamCallback.onFailure() will be invoked.

 # The flag {{isWriting}} is set prematurely to true. This introduces an edge case where a stream that is trying to merge a duplicate block (created because of a speculative task) may interfere with an active stream if the duplicate stream fails.

Also adding additional changes that improve the code.
 # Using positional writes all the time because this simplifies the code and with microbenchmarking haven't seen any performance impact.
 # Additional minor changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org