You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@tez.apache.org by "Kuhu Shukla (JIRA)" <ji...@apache.org> on 2018/09/06 18:26:00 UTC

[jira] [Commented] (TEZ-3972) Tez DAG can hang when a single task fails to fetch

    [ https://issues.apache.org/jira/browse/TEZ-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606216#comment-16606216 ] 

Kuhu Shukla commented on TEZ-3972:
----------------------------------

Good point [~jeagles]. I think if running tasks are zero, we might want to avoid a rerun to indicate that the reporter vertex has in fact finished and it will save us from other possible races which won't show up if everything succeeds (treating this input failure as stale) and allow the DAG to finish. Thoughts?

> Tez DAG can hang when a single task fails to fetch
> --------------------------------------------------
>
>                 Key: TEZ-3972
>                 URL: https://issues.apache.org/jira/browse/TEZ-3972
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>            Priority: Major
>         Attachments: TEZ-3972.001.patch, TEZ-3972.002.patch
>
>
> Description of the hung DAG:
> A DAG with 2 vertices. {{Map}} Vertex has 22k maps, downstream vertex {{Reduce}} has 1009 tasks. All tasks succeed but one, which hangs. This one task (attempt) is doing a local fetch from a node that (now) has a bad disk. It fails to fetch and reports to the AM for the offending input attempt identifiers. However the AM does not schedule a re-run as {{uniquefailedOutputReports}} size is 1 (since only this task attempt failed to fetch) and failure fraction is not met. The denominator for this fraction is the total number of tasks. That causes the re-run to never occur. This JIRA tracks the AM side of the change to alleviate this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)