You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by "Eric Newton (Created) (JIRA)" <ji...@apache.org> on 2012/02/02 20:16:55 UTC

[jira] [Created] (ACCUMULO-366) master killed a tablet server

master killed a tablet server
-----------------------------

                 Key: ACCUMULO-366
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-366
             Project: Accumulo
          Issue Type: Bug
          Components: master
    Affects Versions: 1.4.0
         Environment: randomwalk test on a 10 node test cluster
            Reporter: Eric Newton
            Assignee: Keith Turner


Master killed a tablet server for having long hold times.

The tablet server had this error during minor compaction:

{noformat}
01 23:57:20,073 [security.ZKAuthenticator] ERROR: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/88cd0f63-a36a-4218-86b1-9ba1d2cccf08/users/user004
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/88cd0f63-a36a-4218-86b1-9ba1d2cccf08/users/user004
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1243)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1271)
        at org.apache.accumulo.core.zookeeper.ZooUtil.recursiveDelete(ZooUtil.java:103)
        at org.apache.accumulo.core.zookeeper.ZooUtil.recursiveDelete(ZooUtil.java:117)
        at org.apache.accumulo.server.zookeeper.ZooReaderWriter.recursiveDelete(ZooReaderWriter.java:67)
        at sun.reflect.GeneratedMethodAccessor53.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.accumulo.server.zookeeper.ZooReaderWriter$1.invoke(ZooReaderWriter.java:169)
        at $Proxy4.recursiveDelete(Unknown Source)
        at org.apache.accumulo.server.security.ZKAuthenticator.dropUser(ZKAuthenticator.java:252)
        at org.apache.accumulo.server.security.Auditor.dropUser(Auditor.java:104)
        at org.apache.accumulo.server.client.ClientServiceHandler.dropUser(ClientServiceHandler.java:136)
        at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:58)
        at $Proxy2.dropUser(Unknown Source)
        at org.apache.accumulo.core.client.impl.thrift.ClientService$Processor$dropUser.process(ClientService.java:2257)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor.process(TabletClientService.java:2037)
        at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:151)
        at org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631)
        at org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:199)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
        at java.lang.Thread.run(Thread.java:662)

{noformat}

This tablet was the result of a split that occurred during a delete.  The master missed this tablet when taking tablets offline.

We need to do a consistency check on the offline tablets before deleting the table information in zookeeper.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ACCUMULO-366) master killed a tablet server

Posted by "Keith Turner (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ACCUMULO-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner updated ACCUMULO-366:
----------------------------------

    Affects Version/s:     (was: 1.4.0)
        Fix Version/s: 1.4.0
    
> master killed a tablet server
> -----------------------------
>
>                 Key: ACCUMULO-366
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-366
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master
>         Environment: randomwalk test on a 10 node test cluster
>            Reporter: Eric Newton
>            Assignee: Keith Turner
>             Fix For: 1.4.0
>
>
> Master killed a tablet server for having long hold times.
> The tablet server had this error during minor compaction:
> {noformat}
> 01 23:57:20,073 [security.ZKAuthenticator] ERROR: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/88cd0f63-a36a-4218-86b1-9ba1d2cccf08/users/user004
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/88cd0f63-a36a-4218-86b1-9ba1d2cccf08/users/user004
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1243)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1271)
>         at org.apache.accumulo.core.zookeeper.ZooUtil.recursiveDelete(ZooUtil.java:103)
>         at org.apache.accumulo.core.zookeeper.ZooUtil.recursiveDelete(ZooUtil.java:117)
>         at org.apache.accumulo.server.zookeeper.ZooReaderWriter.recursiveDelete(ZooReaderWriter.java:67)
>         at sun.reflect.GeneratedMethodAccessor53.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.accumulo.server.zookeeper.ZooReaderWriter$1.invoke(ZooReaderWriter.java:169)
>         at $Proxy4.recursiveDelete(Unknown Source)
>         at org.apache.accumulo.server.security.ZKAuthenticator.dropUser(ZKAuthenticator.java:252)
>         at org.apache.accumulo.server.security.Auditor.dropUser(Auditor.java:104)
>         at org.apache.accumulo.server.client.ClientServiceHandler.dropUser(ClientServiceHandler.java:136)
>         at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:58)
>         at $Proxy2.dropUser(Unknown Source)
>         at org.apache.accumulo.core.client.impl.thrift.ClientService$Processor$dropUser.process(ClientService.java:2257)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor.process(TabletClientService.java:2037)
>         at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:151)
>         at org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631)
>         at org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:199)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:662)
> {noformat}
> This tablet was the result of a split that occurred during a delete.  The master missed this tablet when taking tablets offline.
> We need to do a consistency check on the offline tablets before deleting the table information in zookeeper.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ACCUMULO-366) master killed a tablet server

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202625#comment-13202625 ] 

Keith Turner commented on ACCUMULO-366:
---------------------------------------

Saw this bug again.  A minor compaction was attempted after the tablet was closed.  Looking at the code, the initiateMinorCompaction function tries to get the flush id from zookeeper, even if the tablet is closed.   Trying to call initiateMinroCompaction on a closed tablet should do nothing.

{noformat}
07 08:10:38,419 [tabletserver.Tablet] TABLET_HIST: f5i;10a579089cf842a0< closed

07 08:15:38,426 [tabletserver.LargestFirstMemoryManager] DEBUG: IDLE minor compaction chosen
07 08:15:38,427 [tabletserver.LargestFirstMemoryManager] DEBUG: COMPACTING f5i;10a579089cf842a0<  total = 32,091,937 ingestMemory = 32,091,937
07 08:15:38,427 [tabletserver.LargestFirstMemoryManager] DEBUG: chosenMem = 99,252 chosenIT = 300.01 load 125,050
07 08:15:38,427 [tabletserver.TabletServerResourceManager] ERROR: Minor compactions for memory managment failed
java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/fbebb086-960c-4a97-b502-154fc333d766/tables/f5i/flush-id
        at org.apache.accumulo.server.tabletserver.Tablet.getFlushID(Tablet.java:2349)
        at org.apache.accumulo.server.tabletserver.Tablet.initiateMinorCompaction(Tablet.java:2287)
        at org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework.manageMemory(TabletServerResourceManager.java:328)
        at org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework.access$1(TabletServerResourceManager.java:303)
        at org.apache.accumulo.server.tabletserver.TabletServerResourceManager$MemoryManagementFramework$2.run(TabletServerResourceManager.java:252)
        at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/fbebb086-960c-4a97-b502-154fc333d766/tables/f5i/flush-id
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:950)
        at org.apache.accumulo.core.zookeeper.ZooReader.getData(ZooReader.java:42)
        at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.accumulo.server.zookeeper.ZooReaderWriter$1.invoke(ZooReaderWriter.java:169)
        at $Proxy4.getData(Unknown Source)
        at org.apache.accumulo.server.tabletserver.Tablet.getFlushID(Tablet.java:2347)
        ... 6 more

{noformat}
                
> master killed a tablet server
> -----------------------------
>
>                 Key: ACCUMULO-366
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-366
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.4.0
>         Environment: randomwalk test on a 10 node test cluster
>            Reporter: Eric Newton
>            Assignee: Keith Turner
>
> Master killed a tablet server for having long hold times.
> The tablet server had this error during minor compaction:
> {noformat}
> 01 23:57:20,073 [security.ZKAuthenticator] ERROR: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/88cd0f63-a36a-4218-86b1-9ba1d2cccf08/users/user004
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/88cd0f63-a36a-4218-86b1-9ba1d2cccf08/users/user004
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1243)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1271)
>         at org.apache.accumulo.core.zookeeper.ZooUtil.recursiveDelete(ZooUtil.java:103)
>         at org.apache.accumulo.core.zookeeper.ZooUtil.recursiveDelete(ZooUtil.java:117)
>         at org.apache.accumulo.server.zookeeper.ZooReaderWriter.recursiveDelete(ZooReaderWriter.java:67)
>         at sun.reflect.GeneratedMethodAccessor53.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.accumulo.server.zookeeper.ZooReaderWriter$1.invoke(ZooReaderWriter.java:169)
>         at $Proxy4.recursiveDelete(Unknown Source)
>         at org.apache.accumulo.server.security.ZKAuthenticator.dropUser(ZKAuthenticator.java:252)
>         at org.apache.accumulo.server.security.Auditor.dropUser(Auditor.java:104)
>         at org.apache.accumulo.server.client.ClientServiceHandler.dropUser(ClientServiceHandler.java:136)
>         at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:58)
>         at $Proxy2.dropUser(Unknown Source)
>         at org.apache.accumulo.core.client.impl.thrift.ClientService$Processor$dropUser.process(ClientService.java:2257)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor.process(TabletClientService.java:2037)
>         at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:151)
>         at org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631)
>         at org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:199)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:662)
> {noformat}
> This tablet was the result of a split that occurred during a delete.  The master missed this tablet when taking tablets offline.
> We need to do a consistency check on the offline tablets before deleting the table information in zookeeper.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (ACCUMULO-366) master killed a tablet server

Posted by "Keith Turner (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ACCUMULO-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner resolved ACCUMULO-366.
-----------------------------------

    Resolution: Fixed
    
> master killed a tablet server
> -----------------------------
>
>                 Key: ACCUMULO-366
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-366
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master
>         Environment: randomwalk test on a 10 node test cluster
>            Reporter: Eric Newton
>            Assignee: Keith Turner
>             Fix For: 1.4.0
>
>
> Master killed a tablet server for having long hold times.
> The tablet server had this error during minor compaction:
> {noformat}
> 01 23:57:20,073 [security.ZKAuthenticator] ERROR: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/88cd0f63-a36a-4218-86b1-9ba1d2cccf08/users/user004
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/88cd0f63-a36a-4218-86b1-9ba1d2cccf08/users/user004
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1243)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1271)
>         at org.apache.accumulo.core.zookeeper.ZooUtil.recursiveDelete(ZooUtil.java:103)
>         at org.apache.accumulo.core.zookeeper.ZooUtil.recursiveDelete(ZooUtil.java:117)
>         at org.apache.accumulo.server.zookeeper.ZooReaderWriter.recursiveDelete(ZooReaderWriter.java:67)
>         at sun.reflect.GeneratedMethodAccessor53.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.accumulo.server.zookeeper.ZooReaderWriter$1.invoke(ZooReaderWriter.java:169)
>         at $Proxy4.recursiveDelete(Unknown Source)
>         at org.apache.accumulo.server.security.ZKAuthenticator.dropUser(ZKAuthenticator.java:252)
>         at org.apache.accumulo.server.security.Auditor.dropUser(Auditor.java:104)
>         at org.apache.accumulo.server.client.ClientServiceHandler.dropUser(ClientServiceHandler.java:136)
>         at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:58)
>         at $Proxy2.dropUser(Unknown Source)
>         at org.apache.accumulo.core.client.impl.thrift.ClientService$Processor$dropUser.process(ClientService.java:2257)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor.process(TabletClientService.java:2037)
>         at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:151)
>         at org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631)
>         at org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:199)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:662)
> {noformat}
> This tablet was the result of a split that occurred during a delete.  The master missed this tablet when taking tablets offline.
> We need to do a consistency check on the offline tablets before deleting the table information in zookeeper.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira