You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2014/01/30 19:50:11 UTC
[jira] [Resolved] (ACCUMULO-2269) Multiple hung fate operations during randomwalk with agitation

     [ https://issues.apache.org/jira/browse/ACCUMULO-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Elser resolved ACCUMULO-2269.
----------------------------------

       Resolution: Cannot Reproduce
    Fix Version/s:     (was: 1.5.1)

Ultimately, the only reasonable conclusion I can come to right now is that I was over-saturating the system and got into a bad state because of it. Running one less RW client when running with agitation appears have been successful.

Given that we had nothing else to go off of other than zookeeper connections appearing to hang, I'll close this for now.

> Multiple hung fate operations during randomwalk with agitation
> --------------------------------------------------------------
>
>                 Key: ACCUMULO-2269
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2269
>             Project: Accumulo
>          Issue Type: Bug
>          Components: fate, master
>         Environment: 1.5.1-SNAPSHOT: 8981ba04
>            Reporter: Josh Elser
>            Priority: Critical
>
> Was running LongClean randomwalk with agitation. Came back to the system with three tables "stuck" in DELETING on the monitor and a generally idle system. Upon investigation, multiple fate txns appear to be deadlocked, in addition to the delete tables.
> {noformat}
> txid: 7ca950aa8de76a17  status: IN_PROGRESS         op: DeleteTable      locked: [W:2dc]         locking: []              top: CleanUp
> txid: 1071086efdbed442  status: IN_PROGRESS         op: BulkImport       locked: [R:2cr]         locking: []              top: LoadFiles
> txid: 32b86cfe06c2ed5d  status: IN_PROGRESS         op: DeleteTable      locked: [W:2d9]         locking: []              top: CleanUp
> txid: 358c065b6cb0516b  status: IN_PROGRESS         op: DeleteTable      locked: [W:2dw]         locking: []              top: CleanUp
> txid: 26b738ee0b044a96  status: IN_PROGRESS         op: BulkImport       locked: [R:2cr]         locking: []              top: CopyFailed
> txid: 16edd31b3723dc5b  status: IN_PROGRESS         op: BulkImport       locked: [R:2cr]         locking: []              top: CopyFailed
> txid: 63c587eb3df6c1b2  status: IN_PROGRESS         op: CompactRange     locked: [R:2cr]         locking: []              top: CompactionDriver
> txid: 722d8e5488531735  status: IN_PROGRESS         op: BulkImport       locked: [R:2cr]         locking: []              top: CopyFailed
> {noformat}
> I started digging into the DeleteTable ops. Each txn still appears to be active and holds the table_lock for their respective table in ZK, but the /tables/id/ node and all of its children (state, conf, name, etc) still exist.
> Looking at some thread dumps, I have the default (4) repo runner threads. 3 of them are blocked on bulk imports
> {noformat}
> "Repo runner 2" daemon prio=10 tid=0x000000000262b800 nid=0x1ae7 waiting on condition [0x00007f25168e7000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0000000705a05eb8> (a java.util.concurrent.FutureTask)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>         at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:187)
>         at org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:561)
>         at org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:449)
>         at org.apache.accumulo.server.master.tableOps.TraceRepo.call(TraceRepo.java:65)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:64)
>         at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:744)
> {noformat}
> The 4th repo runner is stuck trying to reserve a new txn (not sure why he's locked like this though)
> {noformat}
> "Repo runner 1" daemon prio=10 tid=0x0000000002627800 nid=0x1ae6 in Object.wait() [0x00007f25169e8000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:503)
>         at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1313)
>         - locked <0x00000007014d9928> (a org.apache.zookeeper.ClientCnxn$Packet)
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1149)
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180)
>         at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:44)
>         at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.accumulo.server.zookeeper.ZooReaderWriter$1.invoke(ZooReaderWriter.java:67)
>         at com.sun.proxy.$Proxy11.getData(Unknown Source)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:160)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:156)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:52)
>         at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:744)
> {noformat}
> There were no obvious errors on the monitor, and the master is still presently in this state.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)