You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "zhengchenyu (Jira)" <ji...@apache.org> on 2022/08/09 08:07:00 UTC
[jira] [Comment Edited] (TEZ-4441) TezAppMaster may stuck because of reportError skip send error event.
[ https://issues.apache.org/jira/browse/TEZ-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577244#comment-17577244 ]
zhengchenyu edited comment on TEZ-4441 at 8/9/22 8:06 AM:
----------------------------------------------------------
[~abstractdog] Hi, my tez app stuck in yarn federation cluster.
In our cluster, getAvailableResources return null, then throw NPE. Then AMRMClientAsyncImpl.CallbackHandlerThread was stop, but TezAppMaster was not shutdown. Here two problem, then I submit two issues:
(1) TEZ-4440
Before YARN-8933. getAvailableResources may return null, so throw NPE.
(2) TEZ-4441
event handler will never receive DAGAppMasterEventSchedulingServiceError, so TezAppMaster could not shutdown, then stuck.
Can you please review the two PR:
[https://github.com/apache/tez/pull/235]
[https://github.com/apache/tez/pull/236]
was (Author: zhengchenyu):
[~abstractdog] Hi, my tez app stuck in yarn federation cluster.
In our cluster, getAvailableResources return null, then throw NPE. Then AMRMClientAsyncImpl.CallbackHandlerThread was stop, but TezAppMaster was not shutdown. Here two problem, then I submit two issue:
(1) TEZ-4440
Before YARN-8933. getAvailableResources may return null, so throw NPE.
(2) TEZ-4441
event handler will never receive DAGAppMasterEventSchedulingServiceError, so TezAppMaster could not shutdown, then stuck.
Can you please review the two PR:
https://github.com/apache/tez/pull/235
https://github.com/apache/tez/pull/236
> TezAppMaster may stuck because of reportError skip send error event.
> ---------------------------------------------------------------------
>
> Key: TEZ-4441
> URL: https://issues.apache.org/jira/browse/TEZ-4441
> Project: Apache Tez
> Issue Type: Bug
> Reporter: zhengchenyu
> Priority: Major
> Time Spent: 20m
> Remaining Estimate: 0h
>
> In Yarn mode, after parseAllPlugins, the className of NamedEntityDescriptor is null.
> When some exception was throw(For example described in TEZ-4441.), event handler will never receive DAGAppMasterEventSchedulingServiceError event. Then AppMaster will stuck!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)