You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2014/09/10 06:07:28 UTC

[jira] [Updated] (TEZ-1562) DAGImpl commitOrAbortOutputs takes long time (300+ seconds) for reducer vertex with 4000+ tasks

     [ https://issues.apache.org/jira/browse/TEZ-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rajesh Balamohan updated TEZ-1562:
----------------------------------
    Description: 
M7 --> R1
M2 --> R1

I was running a job with 4000 reducers.  This is in non-session mode.

At the end of the job, it took 300+ seconds just to commit().  

It appears that its doing some cleanup work. But in session mode, this could lead to the following issue

1. Open session
2. Run small job1
3. Run large job1
4. Run small job2

In this case, small job2 wouldn't even start running until large-job1's commit is over which would be in the order of 300+ seconds.  

Please refer to the logs here, 

2014-09-09 17:58:20,422 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Vertex vertex_1409722953518_0143_1_02 completed., numCompletedVertices=3, numSuccessfulVertices=3, numFailedVertices=0, numKilledVertices=0, numVertices=3
2014-09-09 17:58:20,422 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Calling DAG commit/abort for dag: dag_1409722953518_0143_1
2014-09-09 17:58:20,426 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Committing output: Reducer_6_sink for vertex: vertex_1409722953518_0143_1_02
2014-09-09 18:03:25,207 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: No output committers for vertex: Map_7
2014-09-09 18:03:25,207 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: No output committers for vertex: Map_2
2014-09-09 18:03:25,207 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Patch : Time taken to commitOrAbortOutputs : 304784  <==== In milliseconds
2014-09-09 18:03:25,361 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Patch : Time taken to log jobhistory : 304938  <==== Cumulative number
2014-09-09 18:03:25,363 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Patch : Time taken to handle DAGAppMasterEventDAGFinished : 304940 <==== Cumulative number
2014-09-09 18:03:25,363 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: DAG: dag_1409722953518_0143_1 finished with state: SUCCEEDED
2014-09-09 18:03:25,363 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: dag_1409722953518_0143_1 transitioned from RUNNING to SUCCEEDED

Should this operation be moved to a separate thread as the DAG is already marked as SUCCEEDED?

  was:
M7===\
	 R1
	 /
M2===

I was running a job with 4000 reducers.  This is in non-session mode.

At the end of the job, it took 300+ seconds just to commit().  

It appears that its doing some cleanup work. But in session mode, this could lead to the following issue

1. Open session
2. Run small job1
3. Run large job1
4. Run small job2

In this case, small job2 wouldn't even start running until large-job1's commit is over which would be in the order of 300+ seconds.  

Please refer to the logs here, 

2014-09-09 17:58:20,422 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Vertex vertex_1409722953518_0143_1_02 completed., numCompletedVertices=3, numSuccessfulVertices=3, numFailedVertices=0, numKilledVertices=0, numVertices=3
2014-09-09 17:58:20,422 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Calling DAG commit/abort for dag: dag_1409722953518_0143_1
2014-09-09 17:58:20,426 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Committing output: Reducer_6_sink for vertex: vertex_1409722953518_0143_1_02
2014-09-09 18:03:25,207 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: No output committers for vertex: Map_7
2014-09-09 18:03:25,207 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: No output committers for vertex: Map_2
2014-09-09 18:03:25,207 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Patch : Time taken to commitOrAbortOutputs : 304784  <==== In milliseconds
2014-09-09 18:03:25,361 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Patch : Time taken to log jobhistory : 304938  <==== Cumulative number
2014-09-09 18:03:25,363 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Patch : Time taken to handle DAGAppMasterEventDAGFinished : 304940 <==== Cumulative number
2014-09-09 18:03:25,363 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: DAG: dag_1409722953518_0143_1 finished with state: SUCCEEDED
2014-09-09 18:03:25,363 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: dag_1409722953518_0143_1 transitioned from RUNNING to SUCCEEDED

Should this operation be moved to a separate thread as the DAG is already marked as SUCCEEDED?


> DAGImpl commitOrAbortOutputs takes long time (300+ seconds) for reducer vertex with 4000+ tasks
> -----------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1562
>                 URL: https://issues.apache.org/jira/browse/TEZ-1562
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>              Labels: performance
>
> M7 --> R1
> M2 --> R1
> I was running a job with 4000 reducers.  This is in non-session mode.
> At the end of the job, it took 300+ seconds just to commit().  
> It appears that its doing some cleanup work. But in session mode, this could lead to the following issue
> 1. Open session
> 2. Run small job1
> 3. Run large job1
> 4. Run small job2
> In this case, small job2 wouldn't even start running until large-job1's commit is over which would be in the order of 300+ seconds.  
> Please refer to the logs here, 
> 2014-09-09 17:58:20,422 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Vertex vertex_1409722953518_0143_1_02 completed., numCompletedVertices=3, numSuccessfulVertices=3, numFailedVertices=0, numKilledVertices=0, numVertices=3
> 2014-09-09 17:58:20,422 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Calling DAG commit/abort for dag: dag_1409722953518_0143_1
> 2014-09-09 17:58:20,426 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Committing output: Reducer_6_sink for vertex: vertex_1409722953518_0143_1_02
> 2014-09-09 18:03:25,207 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: No output committers for vertex: Map_7
> 2014-09-09 18:03:25,207 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: No output committers for vertex: Map_2
> 2014-09-09 18:03:25,207 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Patch : Time taken to commitOrAbortOutputs : 304784  <==== In milliseconds
> 2014-09-09 18:03:25,361 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Patch : Time taken to log jobhistory : 304938  <==== Cumulative number
> 2014-09-09 18:03:25,363 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Patch : Time taken to handle DAGAppMasterEventDAGFinished : 304940 <==== Cumulative number
> 2014-09-09 18:03:25,363 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: DAG: dag_1409722953518_0143_1 finished with state: SUCCEEDED
> 2014-09-09 18:03:25,363 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: dag_1409722953518_0143_1 transitioned from RUNNING to SUCCEEDED
> Should this operation be moved to a separate thread as the DAG is already marked as SUCCEEDED?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)