You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Ganesha Shreedhara (JIRA)" <ji...@apache.org> on 2019/04/19 06:54:00 UTC
[jira] [Created] (TEZ-4063) DAGClient:tryKillDAG taking long time

Ganesha Shreedhara created TEZ-4063:
---------------------------------------

             Summary: DAGClient:tryKillDAG taking long time
                 Key: TEZ-4063
                 URL: https://issues.apache.org/jira/browse/TEZ-4063
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Ganesha Shreedhara


Hive uses DAGClient:tryKillDAG() to kill tez application. It is taking time to kill when there are too many tasks getting processed. This is because the kill event is getting added to eventQueue and it takes time when the eventQueue has too many events before the kill the event.

I have a job which has ~3L mappers, ~5K reducers and ~1000 parallel tasks running.

When hive query is killed in the middle of this job getting processed, it takes ~6mins for the tasks to start getting killed. It is taking ~3mins for the kill event from AM to reach the DAG and ~3mins again for the kill event from DAG to reach the vertex.

 

Below is the log for the same:
2019-04-10 15:11:35,776 [INFO] [IPC Server handler 0 on 44129] |app.DAGAppMaster|: Sending a kill event to the current DAG, dagId=dag_1554789825317_0535_1
2019-04-10 15:11:35,785 [INFO] [IPC Server handler 0 on 44129] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1554789825317_0535_1][Event:DAG_KILL_REQUEST]: org.apache.tez.dag.history.events.DAGKillRequestEvent@731f79f4
.
.
~ 3 mins of delay
.
.
2019-04-10 15:14:34,171 [INFO] [Dispatcher thread \{Central}] |impl.DAGImpl|: Dag received [DAG_TERMINATE, DAG_KILL] in RUNNING state
.
.
~ 3 mins of delay
.
.
2019-04-10 15:17:52,434 [INFO] [Dispatcher thread \{Central}] |impl.VertexImpl|: Killing tasks in vertex: vertex_1554789825317_0535_1_01 [Reducer 2] due to trigger: DAG_TERMINATED
2019-04-10 15:17:52,439 [INFO] [Dispatcher thread \{Central}] |impl.VertexImpl|: Killing tasks in vertex: vertex_1554789825317_0535_1_00 [Map 1] due to trigger: DAG_TERMINATED
 

Pig uses TezClient:stop() method which kills application in asynchronous manner. It also uses tez.client.timeout-ms configuration which can be configured to kill the yarn application if the client timeout exceeds a threshold value. 

 

Is this an expected behaviour to add kill event to eventQueue and process it synchronously when DAGClient:tryKillDAG() is called? 

Can we process the kill event immediately (may be when a configuration is enabled) if the user doesn't want the past events to be processed? 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)