You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2014/04/10 06:42:15 UTC

[jira] [Commented] (TEZ-1034) Shuffling can sometimes hang with duplicate inputs for the same index

    [ https://issues.apache.org/jira/browse/TEZ-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964992#comment-13964992 ] 

Bikas Saha commented on TEZ-1034:
---------------------------------

Attempt 1 succeeds
For some reason attempt 1 is killed - say because the node got blacklisted
Attempt 2 succeeds

The shuffler gets 2 data movement events on 2 different hosts. So the deduping check does not catch this. The shuffler allocates memory for both fetches. The second fetch will use the memory but not free it since accounting logic ignores duplicate fetches. Bug 1. Fixed in shuffler. This un-freed memory can accumulate in the accounting and lead to unused memory exceeding a threshold at which point further memory is not allocated. Memory can be freed when some of it is merged to disk. But that happens on a different threshold and if that is not met then nothing is freed. The shuffler deadlocks and hangs. Fetch to disk does not trigger if the size of individual fetch is small.

Also, changing the TaskAttempt logic to send input failed events when a successful task is killed. This is generally not correct because the outputs of the tasks may not be harmed by the TA being killed. Currently, though, this will happen on node failure and our intermediate outputs are on those nodes. So this is ok. Although in the case I was looking at, the intermediate data from the failed node was also fine to read.

[~hitesh] Can you please review/commit? Unit test added and existing tests pass. Thanks!


> Shuffling can sometimes hang with duplicate inputs for the same index
> ---------------------------------------------------------------------
>
>                 Key: TEZ-1034
>                 URL: https://issues.apache.org/jira/browse/TEZ-1034
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: TEZ-1034.1.patch, TEZ-1034.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)