You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Hitesh Shah (JIRA)" <ji...@apache.org> on 2015/04/02 01:01:57 UTC

[jira] [Commented] (TEZ-2262) Tez : Catch counters.LimitExceededException and don't fail the DAG

    [ https://issues.apache.org/jira/browse/TEZ-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391686#comment-14391686 ] 

Hitesh Shah commented on TEZ-2262:
----------------------------------

I think there are 2 issues here. One is that Tez does not catch the counter limits exceeded in the AM causing AM crashes when the limit is crossed. Second is that we should probably not fail the DAG when counters' limits are hit ( maybe add diagnostics or something to indicate invalid counters )

Looks like TEZ-2263 was also filed. I will leave this jira for the second issue raised above and change TEZ-2263 to address the first issue. [~mmokhtar] makes sense? 



> Tez : Catch counters.LimitExceededException and don't fail the DAG
> ------------------------------------------------------------------
>
>                 Key: TEZ-2262
>                 URL: https://issues.apache.org/jira/browse/TEZ-2262
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.5.0
>            Reporter: Mostafa Mokhtar
>
> Running TPC-DS Q64 failed due to exceeding the max number of counters.
> DAG should succeed and include a warning in the diagnostics stating that the error got truncated.
> {code}
> 18043560327-2015-04-01 16:23:08,509 INFO [AsyncDispatcher event handler] impl.DAGImpl: No output committers for vertex: Reducer 9
> 18043560445-2015-04-01 16:23:08,857 FATAL [AsyncDispatcher event handler] event.AsyncDispatcher: Error in dispatcher thread
> 18043560557:org.apache.tez.common.counters.LimitExceededException: Too many counters: 1201 max=1200
> 18043560645-	at org.apache.tez.common.counters.Limits.checkCounters(Limits.java:87)
> 18043560717-	at org.apache.tez.common.counters.Limits.incrCounters(Limits.java:94)
> 18043560788-	at org.apache.tez.common.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:75)
> 18043560885-	at org.apache.tez.common.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:92)
> 18043560986-	at org.apache.tez.common.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:103)
> 18043561085-	at org.apache.tez.common.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:198)
> 18043561188-	at org.apache.tez.common.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:363)
> 18043561283-	at org.apache.tez.dag.app.dag.impl.DAGImpl.incrTaskCounters(DAGImpl.java:598)
> 18043561362-	at org.apache.tez.dag.app.dag.impl.DAGImpl.getAllCounters(DAGImpl.java:588)
> 18043561439-	at org.apache.tez.dag.app.dag.impl.DAGImpl.logJobHistoryFinishedEvent(DAGImpl.java:994)
> 18043561528-	at org.apache.tez.dag.app.dag.impl.DAGImpl.finished(DAGImpl.java:1135)
> 18043561600-	at org.apache.tez.dag.app.dag.impl.DAGImpl.checkDAGForCompletion(DAGImpl.java:1048)
> 18043561685-	at org.apache.tez.dag.app.dag.impl.DAGImpl$VertexCompletedTransition.transition(DAGImpl.java:1708)
> 18043561785-	at org.apache.tez.dag.app.dag.impl.DAGImpl$VertexCompletedTransition.transition(DAGImpl.java:1665)
> 18043561885-	at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> 18043562001-	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> 18043562097-	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> 18043562190-	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> 18043562307-	at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:944)
> 18043562376-	at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:126)
> 18043562445-	at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1686)
> 18043562535-	at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1677)
> 18043562625-	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> 18043562709-	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> 18043562790-	at java.lang.Thread.run(Thread.java:745)
> 18043562832-2015-04-01 16:23:08,882 INFO [AsyncDispatcher event handler] event.AsyncDispatcher: Exiting, bbye..
> 18043562932-2015-04-01 16:23:08,885 INFO [Thread-1] app.DAGAppMaster: DAGAppMasterShutdownHook invoked
> 18043563023-2015-04-01 16:23:08,885 INFO [Thread-1] app.DAGAppMaster: DAGAppMaster received a signal. Signaling TaskScheduler
> 18043563137-2015-04-01 16:23:08,885 INFO [Thread-1] rm.TaskSchedulerEventHandler: TaskScheduler notified that iSignalled was : true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)