You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "zhengchenyu (Jira)" <ji...@apache.org> on 2022/08/09 08:07:00 UTC

[jira] [Commented] (TEZ-4441) TezAppMaster may stuck because of reportError skip send error event.

    [ https://issues.apache.org/jira/browse/TEZ-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577244#comment-17577244 ] 

zhengchenyu commented on TEZ-4441:
----------------------------------

[~abstractdog] Hi, my tez app stuck in yarn federation cluster. 

In our cluster, getAvailableResources return null, then throw NPE. Then AMRMClientAsyncImpl.CallbackHandlerThread was stop, but TezAppMaster was not shutdown. Here two problem, then I submit two issue:

(1) TEZ-4440

Before YARN-8933. getAvailableResources may return null, so throw NPE.

(2) TEZ-4441

event handler will never receive DAGAppMasterEventSchedulingServiceError, so TezAppMaster could not shutdown, then stuck.

 

Can you please review the two PR:

https://github.com/apache/tez/pull/235

https://github.com/apache/tez/pull/236

 

> TezAppMaster may stuck because of reportError skip send error event. 
> ---------------------------------------------------------------------
>
>                 Key: TEZ-4441
>                 URL: https://issues.apache.org/jira/browse/TEZ-4441
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: zhengchenyu
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> In Yarn mode, after parseAllPlugins, the className of NamedEntityDescriptor is null.
> When some exception was throw(For example described in TEZ-4441.), event handler will never receive DAGAppMasterEventSchedulingServiceError event. Then AppMaster will stuck! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)