You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2021/01/29 16:29:00 UTC

[jira] [Comment Edited] (TEZ-4277) AsyncDispatcher can hang on serviceStop if the eventhandling thread is in BLOCKED state

    [ https://issues.apache.org/jira/browse/TEZ-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274553#comment-17274553 ] 

László Bodor edited comment on TEZ-4277 at 1/29/21, 4:28 PM:
-------------------------------------------------------------

https://github.com/apache/tez/pull/97
cc: [~ashutoshc]


was (Author: abstractdog):
https://github.com/apache/tez/pull/97

> AsyncDispatcher can hang on serviceStop if the eventhandling thread is in BLOCKED state
> ---------------------------------------------------------------------------------------
>
>                 Key: TEZ-4277
>                 URL: https://issues.apache.org/jira/browse/TEZ-4277
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 0.10.1
>
>
> While further testing TEZ-4236, I discovered a strange hang in AsyncDispatcher. The symptom was a BLOCKED state event-handling thread. This block is most likely caused by the dispatcher itself, so instead of making yet another sync hack in the dispatcher, or even DrainDispatcher (however they have a common codebase for serviceStop at the moment), I'm about to put a safety check, which solved my issue.
> jstack of a hang (it hangs [here|https://github.com/apache/tez/blob/master/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L157])
> {code}
> "Thread-1212" #1581 prio=5 os_prio=31 tid=0x00007fdb4c628800 nid=0x1b84f in Object.wait() [0x000070000997b000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1252)
> - locked <0x00000007ae967340> (a java.lang.Thread)
> at java.lang.Thread.join(Thread.java:1326)
> at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:157)
> at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
> - locked <0x00000007ae642cc8> (a java.lang.Object)
> at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
> at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
> at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:67)
> at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1933)
> at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2155)
> - locked <0x00000007ae53d3c0> (a org.apache.tez.dag.app.LocalDAGAppMaster)
> at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
> - locked <0x00000007ae53d5f0> (a java.lang.Object)
> at org.apache.tez.client.LocalClient.close(LocalClient.java:135)
> at org.apache.tez.client.TezClient.stop(TezClient.java:775)
> - locked <0x00000007ade485a0> (a org.apache.tez.client.TezClient)
> at org.apache.tez.test.TestLocalMode.testMultipleClientsWithSession(TestLocalMode.java:147)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {code}
> during the investigation I added some log messages, that lead to this solution:
> 1. there is not a single event in the queue
> 2. event handling thread is in BLOCKED state
> 3. AM is already shut down, but test hangs as LocalClient.stop waits to AM services to be stopped
> {code}
> 2021-01-29 16:16:44,350 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to stop, ignoring any new events.
> ...
> 2021-01-29 16:16:45,354 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:46,358 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:47,362 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> ...
> 2021-01-29 16:16:48,365 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster (DAGAppMaster.java:run(1002)) - Calling stop for all the services
> 2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster (DAGAppMaster.java:run(1013)) - Exiting DAGAppMaster..GoodBye!
> 2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:49,370 WARN  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(153)) - Cannot wait for drain events anymore, logging current events if any...
> 2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:logCurrentEvents(186)) - Current events in the queue: 0
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(168)) - interrupting event handling thread, state: BLOCKED
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(170)) - interrupted event handling thread, state: BLOCKED
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(172)) - joining event handling thread, state: BLOCKED
> {code}
> under normal circumstances the event handling thread is in WAITING state (which I think is because of queue.take)
> + I also included the timeout logic of YARN-3999 here



--
This message was sent by Atlassian Jira
(v8.3.4#803005)