You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2014/01/28 20:03:40 UTC
[jira] [Commented] (ACCUMULO-2269) Multiple hung fate operations
during randomwalk with agitation
[ https://issues.apache.org/jira/browse/ACCUMULO-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884438#comment-13884438 ]
Josh Elser commented on ACCUMULO-2269:
--------------------------------------
As Keith pointed out to me on IRC, the delete table OPs are probably not running because they're being starved out due to the aforementioned limit of 4 repo runner threads. There are 5 bulk import threads currently active in the master that all appear to stuck in the same place:
{noformat}
"bulk import 7" daemon prio=10 tid=0x0000000001107800 nid=0x4ec9 runnable [0x00007f25146c4000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x000000070568d930> (a sun.nio.ch.Util$2)
- locked <0x000000070568d918> (a java.util.Collections$UnmodifiableSet)
- locked <0x000000070567cb28> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked <0x0000000705b6b020> (a java.io.BufferedInputStream)
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.accumulo.core.client.impl.ThriftTransportPool$CachedTTransport.readAll(ThriftTransportPool.java:271)
at org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:601)
at org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:470)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.accumulo.core.client.impl.thrift.ClientService$Client.recv_bulkImportFiles(ClientService.java:263)
at org.apache.accumulo.core.client.impl.thrift.ClientService$Client.bulkImportFiles(ClientService.java:244)
at org.apache.accumulo.server.master.tableOps.LoadFiles$1.call(BulkImport.java:544)
at org.apache.accumulo.server.master.tableOps.LoadFiles$1.call(BulkImport.java:528)
at org.apache.accumulo.trace.instrument.TraceCallable.call(TraceCallable.java:48)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
at java.lang.Thread.run(Thread.java:744)
Locked ownable synchronizers:
- <0x0000000705aae300> (a java.util.concurrent.ThreadPoolExecutor$Worker)
{noformat}
> Multiple hung fate operations during randomwalk with agitation
> --------------------------------------------------------------
>
> Key: ACCUMULO-2269
> URL: https://issues.apache.org/jira/browse/ACCUMULO-2269
> Project: Accumulo
> Issue Type: Bug
> Components: fate, master
> Environment: 1.5.1-SNAPSHOT: 8981ba04
> Reporter: Josh Elser
> Priority: Critical
> Fix For: 1.5.1
>
>
> Was running LongClean randomwalk with agitation. Came back to the system with three tables "stuck" in DELETING on the monitor and a generally idle system. Upon investigation, multiple fate txns appear to be deadlocked, in addition to the delete tables.
> {noformat}
> txid: 7ca950aa8de76a17 status: IN_PROGRESS op: DeleteTable locked: [W:2dc] locking: [] top: CleanUp
> txid: 1071086efdbed442 status: IN_PROGRESS op: BulkImport locked: [R:2cr] locking: [] top: LoadFiles
> txid: 32b86cfe06c2ed5d status: IN_PROGRESS op: DeleteTable locked: [W:2d9] locking: [] top: CleanUp
> txid: 358c065b6cb0516b status: IN_PROGRESS op: DeleteTable locked: [W:2dw] locking: [] top: CleanUp
> txid: 26b738ee0b044a96 status: IN_PROGRESS op: BulkImport locked: [R:2cr] locking: [] top: CopyFailed
> txid: 16edd31b3723dc5b status: IN_PROGRESS op: BulkImport locked: [R:2cr] locking: [] top: CopyFailed
> txid: 63c587eb3df6c1b2 status: IN_PROGRESS op: CompactRange locked: [R:2cr] locking: [] top: CompactionDriver
> txid: 722d8e5488531735 status: IN_PROGRESS op: BulkImport locked: [R:2cr] locking: [] top: CopyFailed
> {noformat}
> I started digging into the DeleteTable ops. Each txn still appears to be active and holds the table_lock for their respective table in ZK, but the /tables/id/ node and all of its children (state, conf, name, etc) still exist.
> Looking at some thread dumps, I have the default (4) repo runner threads. 3 of them are blocked on bulk imports
> {noformat}
> "Repo runner 2" daemon prio=10 tid=0x000000000262b800 nid=0x1ae7 waiting on condition [0x00007f25168e7000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x0000000705a05eb8> (a java.util.concurrent.FutureTask)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425)
> at java.util.concurrent.FutureTask.get(FutureTask.java:187)
> at org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:561)
> at org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:449)
> at org.apache.accumulo.server.master.tableOps.TraceRepo.call(TraceRepo.java:65)
> at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:64)
> at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}
> The 4th repo runner is stuck trying to reserve a new txn (not sure why he's locked like this though)
> {noformat}
> "Repo runner 1" daemon prio=10 tid=0x0000000002627800 nid=0x1ae6 in Object.wait() [0x00007f25169e8000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:503)
> at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1313)
> - locked <0x00000007014d9928> (a org.apache.zookeeper.ClientCnxn$Packet)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1149)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180)
> at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:44)
> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.accumulo.server.zookeeper.ZooReaderWriter$1.invoke(ZooReaderWriter.java:67)
> at com.sun.proxy.$Proxy11.getData(Unknown Source)
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:160)
> at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:156)
> at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:52)
> at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}
> There were no obvious errors on the monitor, and the master is still presently in this state.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)