You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Ayush Saxena (Jira)" <ji...@apache.org> on 2024/01/25 08:08:00 UTC

[jira] [Resolved] (TEZ-4501) Fix TestLocalMode timeouts - handle drained in case of InterruptedException

     [ https://issues.apache.org/jira/browse/TEZ-4501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ayush Saxena resolved TEZ-4501.
-------------------------------
    Fix Version/s: 0.10.4
       Resolution: Fixed

> Fix TestLocalMode timeouts - handle drained in case of InterruptedException
> ---------------------------------------------------------------------------
>
>                 Key: TEZ-4501
>                 URL: https://issues.apache.org/jira/browse/TEZ-4501
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 0.10.4
>
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> The relevant part from AsyncDispatcher is:
> {code}
>       synchronized (waitForDrained) {
>         while (!drained && eventHandlingThread.isAlive() && System.currentTimeMillis() < endTime) {
>           waitForDrained.wait(1000);
>           LOG.info(
>               "Waiting for AsyncDispatcher to drain. Current queue size: {}, handler thread state: {}",
>               eventQueue.size(), eventHandlingThread.getState());
>         }
>       }
> {code}
> Here is a log snippet from TestLocalMode when looping for TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT simply caused a test timeout because drained was false.
> {code}
> 2023-07-03 12:15:39,938 INFO  [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to stop, ignoring any new events.
> 2023-07-03 12:15:39,938 INFO  [CallbackExecutor] common.AsyncDispatcher (AsyncDispatcher.java:handle(369)) - AsyncDispatcher put event: EventType: C_COMPLETED
> 2023-07-03 12:15:39,938 WARN  [CallbackExecutor] common.AsyncDispatcher (AsyncDispatcher.java:handle(373)) - AsyncDispatcher thread interrupted (while putting event): EventType: C_COMPLETED
> java.lang.InterruptedException
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
> 	at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
> 	at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
> 	at org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
> 	at org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
> 	at org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
> 	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> Exception in thread "CallbackExecutor" org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.InterruptedException
> 	at org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:375)
> 	at org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
> 	at org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
> 	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.InterruptedException
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
> 	at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
> 	at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
> 	at org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
> 	... 6 more
> 2023-07-03 12:15:40,938 INFO  [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:41,939 INFO  [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:42,939 INFO  [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:43,939 INFO  [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:44,940 INFO  [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:45,940 INFO  [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:46,940 INFO  [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:47,941 INFO  [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:48,941 INFO  [Time-limited test] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:49,149 INFO  [Time-limited test] service.AbstractService (AbstractService.java:noteFailure(267)) - Service Central failed in state STOPPED
> java.lang.InterruptedException
> 	at java.lang.Object.wait(Native Method)
> 	at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:155)
> {code}
> what happened here is:
> 1. test is stopped (tezclient.stop)
> 2. eventQueue.put has been [interrupted|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L367], but drained is already [set to false|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338]
> 3. eventQueue.size == 0 and [still looping|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L151], this is only possible if [drained is false: it was set in GenericEventHandler.handle|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338]
> 4. hence test waits for TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT (10000ms default), but test timeout is also 10000ms: instead of increasing test timeout it would make sense to take care of drained variable in the event of an interrupt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)