You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Hitesh Shah (JIRA)" <ji...@apache.org> on 2016/10/18 23:31:59 UTC

[jira] [Comment Edited] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

    [ https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587023#comment-15587023 ] 

Hitesh Shah edited comment on TEZ-3479 at 10/18/16 11:31 PM:
-------------------------------------------------------------

Atleast for this scenario, I think we did not recover task_1476667862449_0031_1_07_000004 properly to a failed state which ends up leading to a hang as the vertex cannot complete.

{code}
2016-10-18 07:06:24,837 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Task Completion: vertex_1476667862449_0031_1_07 [Map 3], tasks=29, failed=1, killed=24, success=3, completed=28, commits=0, err=OWN_TASK_FAILURE 
{code}

The task failure tracked is for task_1476667862449_0031_1_07_000000 and not for 0004.


was (Author: hitesh):
Atleast for this scenario, I think we did not recover task_1476667862449_0031_1_07_000004 properly to a failed state which ends up leading to a hang as the vertex cannot complete.

{code}
2016-10-18 07:06:24,837 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Task Completion: vertex_1476667862449_0031_1_07 [Map 3], tasks=29, failed=1, killed=24, success=3, completed=28, commits=0, err=OWN_TASK_FAILURE 
{code}


> DAG AM does not schedule any more containers in corner cases
> ------------------------------------------------------------
>
>                 Key: TEZ-3479
>                 URL: https://issues.apache.org/jira/browse/TEZ-3479
>             Project: Apache Tez
>          Issue Type: Improvement
>    Affects Versions: 0.7.1
>            Reporter: Rajesh Balamohan
>         Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz
>
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after enough number of retries which happens most of the time. Once in a while (~ once in 20-30 runs), DAG AM gets into hung state and does not schedule any more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)