You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Xuebing Yan (Issue Comment Edited) (JIRA)" <ji...@apache.org> on 2012/03/16 12:07:41 UTC
[jira] [Issue Comment Edited] (FLUME-859) Deadlock in Flume Master during execution of command through flume shell

    [ https://issues.apache.org/jira/browse/FLUME-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231057#comment-13231057 ] 

Xuebing Yan edited comment on FLUME-859 at 3/16/12 11:06 AM:
-------------------------------------------------------------

The deadlock is caused by different order to lock two objects, the patch just changes the order in one execution path. It works well for us for two days.
                
      was (Author: xbing):
    the deadlock is caused by different order to lock two objects, the patch just change the order in one execution path.
                  
> Deadlock in Flume Master during execution of command through flume shell
> ------------------------------------------------------------------------
>
>                 Key: FLUME-859
>                 URL: https://issues.apache.org/jira/browse/FLUME-859
>             Project: Flume
>          Issue Type: Bug
>          Components: Master
>    Affects Versions: v0.9.4
>            Reporter: Aleksei Sudak
>            Priority: Blocker
>             Fix For: v0.9.5
>
>         Attachments: FLUME-859.patch
>
>
> Use case:
> - there are 5 physical boxes running Flume Node attached to single Flume Master. These 5 are used as agents on application side to stream the logs to the Cloud
> - there are 4 physical boxes more running Flume Node attached to the same single Flume Master. These 4 are used as collectors on Cloud side to write logs to HDFS
> - there are around 200 logical nodes (agents and collectors) configured on these 9 Flume Nodes.
> - during deployment configuration was executed for all of these 200 logical nodes (sequentially mostly, in some cases 2 in parallel)
> - configuration for each logical node consists of the following steps: unconfig, unmap, decommission, purge, map, config, refreshAll
> During execution the following deadlock detected (traces are taken with kill -SIGQUIT):
> Java stack information for the threads listed above:
> ===================================================
> "pool-1-thread-1248":
>         at com.cloudera.flume.master.TranslatingConfigurationManager.getLogicalNode(TranslatingConfigurationManager.java:427)
>         - waiting to lock <0x00000003b03724b0> (a com.cloudera.flume.master.logical.LogicalConfigurationManager)
>         at com.cloudera.flume.master.MasterClientServer.getLogicalNodes(MasterClientServer.java:83)
>         at com.cloudera.flume.master.MasterClientServerThrift.getLogicalNodes(MasterClientServerThrift.java:62)
>         at com.cloudera.flume.conf.thrift.ThriftFlumeClientServer$Processor$getLogicalNodes.process(ThriftFlumeClientServer.java:714)
>         at com.cloudera.flume.conf.thrift.ThriftFlumeClientServer$Processor.process(ThriftFlumeClientServer.java:640)
>         at org.apache.thrift.server.TSaneThreadPoolServer$WorkerProcess.run(TSaneThreadPoolServer.java:280)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> "exec-thread":
>         at com.cloudera.flume.master.StatusManager.getStatus(StatusManager.java:213)
>         - waiting to lock <0x00000003b03727e8> (a java.util.HashMap)
>         at com.cloudera.flume.master.logical.LogicalNameManager.updateNode(LogicalNameManager.java:101)
>         - locked <0x00000003b0372ac8> (a com.cloudera.flume.master.logical.LogicalNameManager)
>         at com.cloudera.flume.master.logical.LogicalNameManager.update(LogicalNameManager.java:150)
>         - locked <0x00000003b0372ac8> (a com.cloudera.flume.master.logical.LogicalNameManager)
>         at com.cloudera.flume.master.logical.LogicalConfigurationManager.updateAll(LogicalConfigurationManager.java:236)
>         at com.cloudera.flume.master.TranslatingConfigurationManager.unmapLogicalNode(TranslatingConfigurationManager.java:484)
>         - locked <0x00000003b03724b0> (a com.cloudera.flume.master.logical.LogicalConfigurationManager)
>         at com.cloudera.flume.master.commands.UnmapLogicalNodeForm$1.exec(UnmapLogicalNodeForm.java:71)
>         at com.cloudera.flume.master.CommandManager.exec(CommandManager.java:266)
>         at com.cloudera.flume.master.CommandManager.handleCommand(CommandManager.java:205)
>         at com.cloudera.flume.master.CommandManager$ExecThread.run(CommandManager.java:236)
> "pool-1-thread-37":
>         at com.cloudera.flume.master.TranslatingConfigurationManager.getPhysicalNode(TranslatingConfigurationManager.java:474)
>         - waiting to lock <0x00000003b03724b0> (a com.cloudera.flume.master.logical.LogicalConfigurationManager)
>         at com.cloudera.flume.master.StatusManager.updateHeartbeatStatus(StatusManager.java:97)
>         - locked <0x00000003b03727e8> (a java.util.HashMap)
>         at com.cloudera.flume.master.MasterClientServer.heartbeat(MasterClientServer.java:117)
>         at com.cloudera.flume.master.MasterClientServerThrift.heartbeat(MasterClientServerThrift.java:75)
>         at com.cloudera.flume.conf.thrift.ThriftFlumeClientServer$Processor$heartbeat.process(ThriftFlumeClientServer.java:661)
>         at com.cloudera.flume.conf.thrift.ThriftFlumeClientServer$Processor.process(ThriftFlumeClientServer.java:640)
>         at org.apache.thrift.server.TSaneThreadPoolServer$WorkerProcess.run(TSaneThreadPoolServer.java:280)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Found 1 deadlock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira