You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@tez.apache.org by "Siddharth Seth (JIRA)" <ji...@apache.org> on 2015/05/29 03:27:17 UTC

[jira] [Commented] (TEZ-2502) TezTaskRunner2 not killing tasks properly in all situations

    [ https://issues.apache.org/jira/browse/TEZ-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564045#comment-14564045 ] 

Siddharth Seth commented on TEZ-2502:
-------------------------------------

Here's what happened in this case: 
- A task received an error over RPC because it was killed in the AM.
- This didn't trigger an abort + interrupt - so the task kept running in the daemon.
- It was in the process of building a shared hash table.
- The hash table build won't complete since the task is dead according to the AM - and events aren't received to fetch data.
- New tasks come in and block on the same shared hash table

> TezTaskRunner2 not killing tasks properly in all situations
> -----------------------------------------------------------
>
>                 Key: TEZ-2502
>                 URL: https://issues.apache.org/jira/browse/TEZ-2502
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Sergey Shelukhin
>            Assignee: Siddharth Seth
>
> Looks exactly like all the similar issues, this time there's no deadlock, probably a different root cause. Internal app IDs application_1431919257083_3137	 and application_1431919257083_3133, logs upon request. This time there's no deadlock. The hypothesis during discussion was that shared hashtable loader got preempted and those waiting for this hashtable run forever. May be something else.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)