You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2016/04/29 17:28:12 UTC

[jira] [Commented] (TEZ-3237) Corrupted shuffle transfers to disk are not detected during transfer

    [ https://issues.apache.org/jira/browse/TEZ-3237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264191#comment-15264191 ] 

Jason Lowe commented on TEZ-3237:
---------------------------------

Transfers to memory are detected because the data is decompressed and checksum-verified as it streams to memory.  One workaround would be to do decompress the data _only for checksum verification_ while writing the data to disk.  Then we would be able to detect bad data during the transfer instead of later when we try to read it back from disk.

A more preferable solution would be to avoid decompressing the data and simply checksum the raw bits being transferred, but the MapReduce shuffle protocol Tez is reusing doesn't support that.  If we ever get around to doing a Tez-specific shuffle protocol then doing a better job of validating the raw data being transferred would be a nice addition.

> Corrupted shuffle transfers to disk are not detected during transfer
> --------------------------------------------------------------------
>
>                 Key: TEZ-3237
>                 URL: https://issues.apache.org/jira/browse/TEZ-3237
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>
> When a shuffle transfer is larger than the single transfer limit it gets written straight to disk during the transfer.  Unfortunately there are no checksum validations performed during that transfer, so if the data is corrupted at the source or during transmit it goes undetected.  Only later when the task tries to consume the transferred data is the error detected, but at that point it's too late to blame the source task for the error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)