You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2014/01/30 19:50:11 UTC
[jira] [Resolved] (ACCUMULO-2269) Multiple hung fate operations
during randomwalk with agitation
[ https://issues.apache.org/jira/browse/ACCUMULO-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Elser resolved ACCUMULO-2269.
----------------------------------
Resolution: Cannot Reproduce
Fix Version/s: (was: 1.5.1)
Ultimately, the only reasonable conclusion I can come to right now is that I was over-saturating the system and got into a bad state because of it. Running one less RW client when running with agitation appears have been successful.
Given that we had nothing else to go off of other than zookeeper connections appearing to hang, I'll close this for now.
> Multiple hung fate operations during randomwalk with agitation
> --------------------------------------------------------------
>
> Key: ACCUMULO-2269
> URL: https://issues.apache.org/jira/browse/ACCUMULO-2269
> Project: Accumulo
> Issue Type: Bug
> Components: fate, master
> Environment: 1.5.1-SNAPSHOT: 8981ba04
> Reporter: Josh Elser
> Priority: Critical
>
> Was running LongClean randomwalk with agitation. Came back to the system with three tables "stuck" in DELETING on the monitor and a generally idle system. Upon investigation, multiple fate txns appear to be deadlocked, in addition to the delete tables.
> {noformat}
> txid: 7ca950aa8de76a17 status: IN_PROGRESS op: DeleteTable locked: [W:2dc] locking: [] top: CleanUp
> txid: 1071086efdbed442 status: IN_PROGRESS op: BulkImport locked: [R:2cr] locking: [] top: LoadFiles
> txid: 32b86cfe06c2ed5d status: IN_PROGRESS op: DeleteTable locked: [W:2d9] locking: [] top: CleanUp
> txid: 358c065b6cb0516b status: IN_PROGRESS op: DeleteTable locked: [W:2dw] locking: [] top: CleanUp
> txid: 26b738ee0b044a96 status: IN_PROGRESS op: BulkImport locked: [R:2cr] locking: [] top: CopyFailed
> txid: 16edd31b3723dc5b status: IN_PROGRESS op: BulkImport locked: [R:2cr] locking: [] top: CopyFailed
> txid: 63c587eb3df6c1b2 status: IN_PROGRESS op: CompactRange locked: [R:2cr] locking: [] top: CompactionDriver
> txid: 722d8e5488531735 status: IN_PROGRESS op: BulkImport locked: [R:2cr] locking: [] top: CopyFailed
> {noformat}
> I started digging into the DeleteTable ops. Each txn still appears to be active and holds the table_lock for their respective table in ZK, but the /tables/id/ node and all of its children (state, conf, name, etc) still exist.
> Looking at some thread dumps, I have the default (4) repo runner threads. 3 of them are blocked on bulk imports
> {noformat}
> "Repo runner 2" daemon prio=10 tid=0x000000000262b800 nid=0x1ae7 waiting on condition [0x00007f25168e7000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x0000000705a05eb8> (a java.util.concurrent.FutureTask)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425)
> at java.util.concurrent.FutureTask.get(FutureTask.java:187)
> at org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:561)
> at org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:449)
> at org.apache.accumulo.server.master.tableOps.TraceRepo.call(TraceRepo.java:65)
> at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:64)
> at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}
> The 4th repo runner is stuck trying to reserve a new txn (not sure why he's locked like this though)
> {noformat}
> "Repo runner 1" daemon prio=10 tid=0x0000000002627800 nid=0x1ae6 in Object.wait() [0x00007f25169e8000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:503)
> at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1313)
> - locked <0x00000007014d9928> (a org.apache.zookeeper.ClientCnxn$Packet)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1149)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180)
> at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:44)
> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.accumulo.server.zookeeper.ZooReaderWriter$1.invoke(ZooReaderWriter.java:67)
> at com.sun.proxy.$Proxy11.getData(Unknown Source)
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:160)
> at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:156)
> at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:52)
> at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}
> There were no obvious errors on the monitor, and the master is still presently in this state.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)