You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Andras Beni (JIRA)" <ji...@apache.org> on 2017/02/22 20:15:44 UTC

[jira] [Comment Edited] (FLUME-3056) TestApplication hangs indefinitely

    [ https://issues.apache.org/jira/browse/FLUME-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879129#comment-15879129 ] 

Andras Beni edited comment on FLUME-3056 at 2/22/17 8:14 PM:
-------------------------------------------------------------

The issue was discovered while running tests with maven. JUnit ran test class TestApplication for exactly 15 minutes which is the default timeout for forked processes, and did not report on test results. 
I was able to reproduce the problem (it is to be observed once every few hundred runs) and made the thread dump of which a part is visible in this issues description. The test method TestApplication.testFlume1854() starts an application and stops it almost immediately. The application started contains a  PollingPropertiesFileConfigurationProvider. This configuration provider starts a separate thread in its start() method, that polls the configuration file every 30 seconds. When it find out, that the file has changed (and on first run), it notifies the application through EventBus, that it has to reconfigure. 
In the case of the test, the application is started, which starts the configuration provider and its poller thread. By the time the thread tries to send the configuration event to the Application object, it is already stopping, so (method stop() being synchronized ) it locks on itself. But the poller thread is also trying to lock on the application by calling handleConfigurationEvent(...). As Application.stop() waits for the poller thread to terminate and the poller thread waits for Application.stop() to release the lock, we get in a deadlock.
In normal usage it is possible, that a user edits configuration and stops or restarts the application. If the shutdown hook runs at the same time as the poller notices the change, the application will get into the same deadlock, and the user will not be able to shutdown flume properly.

The same issue cannot happen with Application.start(), because start does not wait for the poller thread, so even if the poller has to wait for start to release the lock, it will not have to wait forever.


was (Author: andrasbeni):
The issue was discovered while running tests with maven. JUnit ran test class TestApplication for exactly 15 minutes which is the default timeout for forked processes, and did not report on test results. 
I was able to reproduce the problem (it is to be observed once every few hundred runs) and made the thread dump of which a part is visible in this issues description. The test method TestApplication#testFlume1854() starts an application and stops it almost immediately. The application started contains a  PollingPropertiesFileConfigurationProvider. This configuration provider starts a separate thread in its start() method, that polls the configuration file every 30 seconds. When it find out, that the file has changed (and on first run), it notifies the application through EventBus, that it has to reconfigure. 
In the case of the test, the application is started, which starts the configuration provider and its poller thread. By the time the thread tries to send the configuration event to the Application object, it is already stopping, so (method stop() being synchronized ) it locks on itself. But the poller thread is also trying to lock on the application by calling handleConfigurationEvent(...). As Application.stop() waits for the poller thread to terminate and the poller thread waits for Application.stop() to release the lock, we get in a deadlock.
In normal usage it is possible, that a user edits configuration and stop or restarts the application. If the shutdown hook runs at the same time as the poller notices the change, the application will get into the same deadlock, and the user will not be able to shutdown flume properly.

> TestApplication hangs indefinitely
> ----------------------------------
>
>                 Key: FLUME-3056
>                 URL: https://issues.apache.org/jira/browse/FLUME-3056
>             Project: Flume
>          Issue Type: Bug
>          Components: Configuration, Test
>    Affects Versions: v1.7.0
>            Reporter: Andras Beni
>            Priority: Minor
>
> Unit test hangs indefinitely when TestApplication.testFLUME1854() becomes blocked in the following, deadlock-like situation:
> Application waits for PollingPropertiesFileConfigurationProvider to stop while PollingPropertiesFileConfigurationProvider tries to notify Application of configuration change.
> {noformat}
> "conf-file-poller-0" #17750 prio=5 os_prio=31 tid=0x00007fdeb7972000 nid=0x638f waiting for monitor entry [0x000070000eb36000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
> 	at org.apache.flume.node.Application.handleConfigurationEvent(Application.java:87)
> 	- waiting to lock <0x000000077a130178> (a org.apache.flume.node.Application)
> 	at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at com.google.common.eventbus.EventHandler.handleEvent(EventHandler.java:68)
> 	at com.google.common.eventbus.SynchronizedEventHandler.handleEvent(SynchronizedEventHandler.java:45)
> 	- locked <0x000000077a1301b0> (a com.google.common.eventbus.SynchronizedEventHandler)
> 	at com.google.common.eventbus.EventBus.dispatch(EventBus.java:313)
> 	at com.google.common.eventbus.EventBus.dispatchQueuedEvents(EventBus.java:296)
> 	at com.google.common.eventbus.EventBus.post(EventBus.java:264)
> 	at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:141)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
>    Locked ownable synchronizers:
> 	- <0x000000077a130250> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "main" #1 prio=5 os_prio=31 tid=0x00007fdeb9000000 nid=0x1b03 waiting on condition [0x000070000d779000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
> 	at sun.misc.Unsafe.park(Native Method)
> 	- parking to wait for  <0x00000006c253e9f8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> 	at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1465)
> 	at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(Executors.java:675)
> 	at org.apache.flume.node.PollingPropertiesFileConfigurationProvider.stop(PollingPropertiesFileConfigurationProvider.java:88)
> 	at org.apache.flume.lifecycle.LifecycleSupervisor.stop(LifecycleSupervisor.java:104)
> 	- locked <0x000000077a138158> (a org.apache.flume.lifecycle.LifecycleSupervisor)
> 	at org.apache.flume.node.Application.stop(Application.java:92)
> 	- locked <0x000000077a130178> (a org.apache.flume.node.Application)
> 	at org.apache.flume.node.TestApplication.testFLUME1854(TestApplication.java:155)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)