You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Siddharth Seth (JIRA)" <ji...@apache.org> on 2014/03/07 23:29:43 UTC

[jira] [Updated] (TEZ-918) Shuffle can hang if there are intermittent fetch failures

     [ https://issues.apache.org/jira/browse/TEZ-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated TEZ-918:
-------------------------------

    Attachment: TEZ-918.1.txt

Initial patch.
Changes Shuffle and fetch behaviour back to what it used to be. Adds a condition in the AM to fail an Input based on absolute number of failures, instead of just based on ratios.

There's no unit test included. I'll try adding that if psosible - otherwise a separate jira for that. Meanwhile testing it locally and with help from [~tassapola]

[~hitesh], could you please take a look.

> Shuffle can hang if there are intermittent fetch failures
> ---------------------------------------------------------
>
>                 Key: TEZ-918
>                 URL: https://issues.apache.org/jira/browse/TEZ-918
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>            Priority: Critical
>         Attachments: TEZ-918.1.txt
>
>
> Post TEZ-902 - if there's a fetch failure, the task could end up hanging - waiting for the specific Input.
> Had spoken to [~bikassaha] about this offline while looking at TEZ-902 - another similar issue already exists under the fault tollerance jira, but that occurs rarely and under specific circumstances. 
> Will try fixing this tomorrow, otherwise may revert TEZ-902.



--
This message was sent by Atlassian JIRA
(v6.2#6252)