You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2021/08/27 11:43:00 UTC

[jira] [Commented] (TEZ-4139) Tez should consider node information for computing failure fraction

    [ https://issues.apache.org/jira/browse/TEZ-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405759#comment-17405759 ] 

László Bodor commented on TEZ-4139:
-----------------------------------

[~rajesh.balamohan]: I'm about to refresh this patch again, I hope can create a patch soon
checked the conversation, seems like we're about to consider downstream hosts, but I would like to consider upstream hosts too because recently I face shuffle issues where lots of read error happens due to a single node failure, and even if the mapper task is marked as OUTPUT_LOST, task attempts fail because of the bumped up failure fraction

I would like to handle both upstream and downstream hosts, please let me know if it doesn't make sense

> Tez should consider node information for computing failure fraction
> -------------------------------------------------------------------
>
>                 Key: TEZ-4139
>                 URL: https://issues.apache.org/jira/browse/TEZ-4139
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TEZ-4139.01.WIP.patch, TEZ-4139.02.WIP.patch
>
>
> When lots of downstream attempts fail to pull the information from source task, source task is marked as failed and it is retried. Currently failure fraction is handled by looking at unique task attempts from downstream. However, it should consider taking into account node information for computing "failureFraction".
> https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java#L1845-L1849



--
This message was sent by Atlassian Jira
(v8.3.4#803005)