You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Yingda Chen (JIRA)" <ji...@apache.org> on 2019/04/26 23:19:00 UTC
[jira] [Assigned] (TEZ-4063) DAGClient:tryKillDAG taking long time

     [ https://issues.apache.org/jira/browse/TEZ-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yingda Chen reassigned TEZ-4063:
--------------------------------

    Assignee: Ying Han

> DAGClient:tryKillDAG taking long time
> -------------------------------------
>
>                 Key: TEZ-4063
>                 URL: https://issues.apache.org/jira/browse/TEZ-4063
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Ganesha Shreedhara
>            Assignee: Ying Han
>            Priority: Major
>
> Hive uses DAGClient:tryKillDAG() to kill tez application. It is taking time to kill when there are too many tasks getting processed. This is because the kill event is getting added to eventQueue and it takes time when the eventQueue has too many events before the kill the event.
> I have a job which has ~3L mappers, ~5K reducers and ~1000 parallel tasks running.
> When hive query is killed in the middle of this job getting processed, it takes ~6mins for the tasks to start getting killed. It is taking ~3mins for the kill event from AM to reach the DAG and ~3mins again for the kill event from DAG to reach the vertex.
>  
> *Below is the log for the same:* 
> {code:java}
> 2019-04-10 15:11:35,776 [INFO] [IPC Server handler 0 on 44129] |app.DAGAppMaster|: Sending a kill event to the current DAG, dagId=dag_1554789825317_0535_1
>  2019-04-10 15:11:35,785 [INFO] [IPC Server handler 0 on 44129] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1554789825317_0535_1][Event:DAG_KILL_REQUEST]: org.apache.tez.dag.history.events.DAGKillRequestEvent@731f79f4
>  .
>  .
>  ~ 3 mins of delay
>  .
>  .
>  2019-04-10 15:14:34,171 [INFO] [Dispatcher thread \{Central}] |impl.DAGImpl|: Dag received [DAG_TERMINATE, DAG_KILL] in RUNNING state
>  .
>  .
>  ~ 3 mins of delay
>  .
>  .
>  2019-04-10 15:17:52,434 [INFO] [Dispatcher thread \{Central}] |impl.VertexImpl|: Killing tasks in vertex: vertex_1554789825317_0535_1_01 [Reducer 2] due to trigger: DAG_TERMINATED
>  2019-04-10 15:17:52,439 [INFO] [Dispatcher thread \{Central}] |impl.VertexImpl|: Killing tasks in vertex: vertex_1554789825317_0535_1_00 [Map 1] due to trigger: DAG_TERMINATED{code}
>  
> Pig uses TezClient:stop() method which kills application in asynchronous manner. It also uses tez.client.timeout-ms configuration which can be configured to kill the yarn application if the client timeout exceeds a threshold value. 
>  
> Is this an expected behaviour to add kill event to eventQueue and process it synchronously when DAGClient:tryKillDAG() is called? 
> Can we process the kill event immediately (may be when a configuration is enabled) if the user doesn't want the past events to be processed? 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)