You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Prasad Mujumdar (Updated) (JIRA)" <ji...@apache.org> on 2011/09/29 04:10:45 UTC

[jira] [Updated] (FLUME-768) Agent deadlock possible due to blocked latch in driver thread.

     [ https://issues.apache.org/jira/browse/FLUME-768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prasad Mujumdar updated FLUME-768:
----------------------------------

    Attachment: Flume-768.patch

It looks like that in the repro case the Trigger Thread is aborted due to some unexpected error. If its killed for any reason other interrupt, then the it doesn't clear the doneLatch which leaves the pumper thread waiting forever. The patch is simply to clear that latch on exit in all cases.

                
> Agent deadlock possible due to blocked latch in driver thread.
> --------------------------------------------------------------
>
>                 Key: FLUME-768
>                 URL: https://issues.apache.org/jira/browse/FLUME-768
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: v0.9.4
>            Reporter: Jonathan Hsieh
>            Assignee: Prasad Mujumdar
>             Fix For: v0.9.5
>
>         Attachments: Flume-768.patch
>
>
> There are three threads essentially blocked. 2 of the three are blocked because of the 3rd.  
> The main problem is that roll close is blocked attempting for a close to complete.  It has a subordinate thread that seems to be gone normally triggers the latch that allows it to close.  My guess is some exception in that TriggerThread exited and because the latch countdowns aren't present, the ok to shutdown latch never got cleared.
> The other two threads are blocked because this -- and likely wouldn't get stuck here if that intermediate threads wasn't stuck.
> The agent's avro source queue is full and it is blocked trying to enqueue more data.
> There is also another thread that is blocked -- it is wal draining thread is blocked with nothing left to do (why everything is in sent state).  This doesn't seem to be part of the problem.
> Thread 21 (448511246@qtp-1388647956-1):
>   State: WAITING
>   Blocked count: 3
>   Waited count: 29
>   Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@11031d18
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>     java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
>     java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:306)
>     com.cloudera.flume.handlers.avro.AvroEventSource.enqueue(AvroEventSource.java:114)
>     com.cloudera.flume.handlers.avro.AvroEventSource$1.append(AvroEventSource.java:135)
>     sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>     sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     java.lang.reflect.Method.invoke(Method.java:597)
>     org.apache.avro.specific.SpecificResponder.respond(SpecificResponder.java:93)
>     org.apache.avro.ipc.Responder.respond(Responder.java:136)
>     org.apache.avro.ipc.Responder.respond(Responder.java:88)
>     org.apache.avro.ipc.ResponderServlet.doPost(ResponderServlet.java:48)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>     org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>     org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390)
>     org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
>     org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>     org.mortbay.jetty.Server.handle(Server.java:326)
> Here's another thread that is essentially blocked:
> Thread 19 (logicalNode agent-19):
>   State: WAITING
>   Blocked count: 83
>   Waited count: 1143043
>   Waiting on java.util.concurrent.CountDownLatch$Sync@5c328896
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>     java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
>     java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
>     java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
>     java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207)
>     com.cloudera.flume.handlers.rolling.RollSink.close(RollSink.java:213)
>     com.cloudera.flume.agent.durability.NaiveFileWALDeco.close(NaiveFileWALDeco.java:147)
>     com.cloudera.flume.agent.AgentSink.close(AgentSink.java:118)
>     com.cloudera.flume.core.EventSinkDecorator.close(EventSinkDecorator.java:67)
>     com.cloudera.flume.handlers.debug.LazyOpenDecorator.close(LazyOpenDecorator.java:81)
>     com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:121)
> Here's the wal draining thread trying to pull things out of the wal.
> Thread 24 (naive file wal transmit-24):
>   State: TIMED_WAITING
>   Blocked count: 156
>   Waited count: 171352
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
>     java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
>     java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424)
>     com.cloudera.flume.agent.durability.NaiveFileWALManager.getUnackedSource(NaiveFileWALManager.java:763)
>     com.cloudera.flume.agent.durability.WALSource.next(WALSource.java:104)
>     com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:91

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira