You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2015/05/05 03:25:06 UTC

[jira] [Comment Edited] (ACCUMULO-3772) WAL prematurely deleted for root table

    [ https://issues.apache.org/jira/browse/ACCUMULO-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527726#comment-14527726 ] 

Josh Elser edited comment on ACCUMULO-3772 at 5/5/15 1:24 AM:
--------------------------------------------------------------

Still trying to sense together what's happening. [~ecn] I could use an explanation as to why we have {{RootTable#ZROOT_TABLET_WALOGS}} _and_ {{RootTable#ZROOT_TABLET_CURRENT_LOGS}}. I have no definitive reason as to why we need two. Are WALOGS the ones available for use and CURRENT_LOGS the ones in use by the root table?

We know that we compacted root and removed the log entries

{noformat}
2015-05-04 08:15:35,850 [tablet.Tablet] DEBUG: Logs for memory compacted: +r<< os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997/hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:15:35,851 [tablet.Tablet] DEBUG: Logs to be destroyed: +r<< os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997/hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:15:35,869 [tablet.DatafileManager] TABLET_HIST: +r<< MinC [memory] -> hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/tables/+r/root_tablet/F0000038.rf
2015-05-04 08:15:36,169 [tablet.DatafileManager] TABLET_HIST: +r<< MajC [hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/tables/+r/root_tablet/A0000037.rf, hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/tables/+r/root_tablet/F0000038.rf] --> hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/tables/+r/root_tablet/A0000039.rf
{noformat}

A short while later, we got a shutdown request and cleanly unloaded things (just don't ask me why root was closed before a metadata tablet was -- another [~ecn] question)

{noformat}
2015-05-04 08:19:04,164 [tablet.Tablet] DEBUG: initiateClose(saveState=true queueMinC=false disableWrites=false) +r<<
2015-05-04 08:19:04,165 [tablet.Tablet] DEBUG: completeClose(saveState=true completeClose=true) +r<<
2015-05-04 08:19:04,210 [tablet.Tablet] TABLET_HIST: +r<< closed
2015-05-04 08:19:04,211 [tserver.TabletServer] DEBUG: Unassigning +r<<@(null,os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997[14d1df69c910000],null)
2015-05-04 08:19:04,211 [state.ZooStore] DEBUG: Removing /root_tablet/location
2015-05-04 08:19:04,225 [state.ZooStore] DEBUG: Removing /root_tablet/future_location
2015-05-04 08:19:04,227 [state.ZooTabletStateStore] DEBUG: unassign root tablet location
2015-05-04 08:19:04,228 [tserver.TabletServer] INFO : unloaded +r<<
2015-05-04 08:19:04,233 [tserver.TabletServer] DEBUG: MultiScanSess 192.168.74.4:49459 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1) 
2015-05-04 08:19:04,579 [tablet.Tablet] DEBUG: initiateClose(saveState=true queueMinC=false disableWrites=false) !0<;~
2015-05-04 08:19:04,605 [tablet.MinorCompactor] DEBUG: Begin minor compaction hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/tables/!0/default_tablet/F000003a.rf_tmp !0<;~
2015-05-04 08:19:04,743 [tabletserver.LargestFirstMemoryManager] DEBUG: BEFORE compactionThreshold = 0.605 maxObserved = 477
2015-05-04 08:19:04,743 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER compactionThreshold = 0.666
2015-05-04 08:19:04,749 [tablet.Compactor] DEBUG: Compaction !0<;~ 1 read | 1 written |    200 entries/sec |  0.005 secs
2015-05-04 08:19:04,763 [tablet.Tablet] DEBUG: Logs for memory compacted: !0<;~ os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997/hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:19:04,764 [tablet.Tablet] DEBUG: Logs to be destroyed: !0<;~ os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997/hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:19:04,820 [log.TabletServerLogger] DEBUG:  wrote MinC finish  32: writeTime:9ms 
2015-05-04 08:19:04,827 [tablet.DatafileManager] TABLET_HIST: !0<;~ MinC [memory] -> hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/tables/!0/default_tablet/F000003a.rf
2015-05-04 08:19:04,827 [tablet.DatafileManager] DEBUG: MinC finish lock 0.00 secs !0<;~
2015-05-04 08:19:04,828 [tablet.Tablet] DEBUG: completeClose(saveState=true completeClose=true) !0<;~
2015-05-04 08:19:04,875 [tablet.Tablet] TABLET_HIST: !0<;~ closed
2015-05-04 08:19:04,875 [tserver.TabletServer] DEBUG: Unassigning !0<;~@(null,os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997[14d1df69c910000],null)
2015-05-04 08:19:04,902 [tserver.TabletServer] INFO : unloaded !0<;~
2015-05-04 08:19:05,093 [tserver.TabletServer] INFO : Master requested tablet server halt
{noformat}




was (Author: elserj):
Still trying to sense together what's happening. [~ecn] I could use an explanation as to why we have {{RootTable#ZROOT_TABLET_WALOGS}} _and_ {{RootTable#ZROOT_TABLET_CURRENT_LOGS}}. I have no definitive reason as to why we need two. Are WALOGS the ones available for use and CURRENT_LOGS the ones in use by the root table?

We know that we compacted root and removed the log entries

{noformat}
2015-05-04 08:15:35,850 [tablet.Tablet] DEBUG: Logs for memory compacted: +r<< os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997/hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:15:35,851 [tablet.Tablet] DEBUG: Logs to be destroyed: +r<< os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997/hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:15:35,869 [tablet.DatafileManager] TABLET_HIST: +r<< MinC [memory] -> hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/tables/+r/root_tablet/F0000038.rf
2015-05-04 08:15:36,169 [tablet.DatafileManager] TABLET_HIST: +r<< MajC [hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/tables/+r/root_tablet/A0000037.rf, hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/tables/+r/root_tablet/F0000038.rf] --> hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/tables/+r/root_tablet/A0000039.rf
{noformat}

A short while later, we got a shutdown request and cleanly unloaded things (just don't ask me why root was closed before a metadata tablet was -- another [~ecn]

{noformat}
2015-05-04 08:19:04,164 [tablet.Tablet] DEBUG: initiateClose(saveState=true queueMinC=false disableWrites=false) +r<<
2015-05-04 08:19:04,165 [tablet.Tablet] DEBUG: completeClose(saveState=true completeClose=true) +r<<
2015-05-04 08:19:04,210 [tablet.Tablet] TABLET_HIST: +r<< closed
2015-05-04 08:19:04,211 [tserver.TabletServer] DEBUG: Unassigning +r<<@(null,os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997[14d1df69c910000],null)
2015-05-04 08:19:04,211 [state.ZooStore] DEBUG: Removing /root_tablet/location
2015-05-04 08:19:04,225 [state.ZooStore] DEBUG: Removing /root_tablet/future_location
2015-05-04 08:19:04,227 [state.ZooTabletStateStore] DEBUG: unassign root tablet location
2015-05-04 08:19:04,228 [tserver.TabletServer] INFO : unloaded +r<<
2015-05-04 08:19:04,233 [tserver.TabletServer] DEBUG: MultiScanSess 192.168.74.4:49459 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1) 
2015-05-04 08:19:04,579 [tablet.Tablet] DEBUG: initiateClose(saveState=true queueMinC=false disableWrites=false) !0<;~
2015-05-04 08:19:04,605 [tablet.MinorCompactor] DEBUG: Begin minor compaction hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/tables/!0/default_tablet/F000003a.rf_tmp !0<;~
2015-05-04 08:19:04,743 [tabletserver.LargestFirstMemoryManager] DEBUG: BEFORE compactionThreshold = 0.605 maxObserved = 477
2015-05-04 08:19:04,743 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER compactionThreshold = 0.666
2015-05-04 08:19:04,749 [tablet.Compactor] DEBUG: Compaction !0<;~ 1 read | 1 written |    200 entries/sec |  0.005 secs
2015-05-04 08:19:04,763 [tablet.Tablet] DEBUG: Logs for memory compacted: !0<;~ os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997/hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:19:04,764 [tablet.Tablet] DEBUG: Logs to be destroyed: !0<;~ os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997/hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:19:04,820 [log.TabletServerLogger] DEBUG:  wrote MinC finish  32: writeTime:9ms 
2015-05-04 08:19:04,827 [tablet.DatafileManager] TABLET_HIST: !0<;~ MinC [memory] -> hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/tables/!0/default_tablet/F000003a.rf
2015-05-04 08:19:04,827 [tablet.DatafileManager] DEBUG: MinC finish lock 0.00 secs !0<;~
2015-05-04 08:19:04,828 [tablet.Tablet] DEBUG: completeClose(saveState=true completeClose=true) !0<;~
2015-05-04 08:19:04,875 [tablet.Tablet] TABLET_HIST: !0<;~ closed
2015-05-04 08:19:04,875 [tserver.TabletServer] DEBUG: Unassigning !0<;~@(null,os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997[14d1df69c910000],null)
2015-05-04 08:19:04,902 [tserver.TabletServer] INFO : unloaded !0<;~
2015-05-04 08:19:05,093 [tserver.TabletServer] INFO : Master requested tablet server halt
{noformat}

> WAL prematurely deleted for root table
> --------------------------------------
>
>                 Key: ACCUMULO-3772
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3772
>             Project: Accumulo
>          Issue Type: Bug
>          Components: gc, tserver
>         Environment: SHA1 2436eec237dc23608e91cae56777a323e2f77295 (I believe). Running with Kerberos on. HDFS 2.7.1-SNAPSHOT
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Blocker
>             Fix For: 1.7.0
>
>
> I have a 5-node test system which I ran all of the ITs against.
> When I came back to the cluster later, I observed the following:
> * Multiple running MinCs against accumulo.metadata that weren't running
> * {{accumulo.root}} without an online tablet
> * Multiple queued FATE ops (PopulateMetadata blocked).
> Ultimately, found that recovery was repeatedly failing (without any notification on the monitor, mind you). 
> {noformat}
> 2015-05-04 21:44:16,151 [state.ZooTabletStateStore] DEBUG: root tablet log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-4.novalocal+9997/fc0e8426-
> f295-40d7-888e-f4b7dde1f966
> 2015-05-04 21:44:16,151 [state.ZooTabletStateStore] DEBUG: root tablet log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal+9997/a1a2c22f-
> f90d-4542-ab24-3ebdfa5d6fe4
> 2015-05-04 21:44:16,152 [state.ZooTabletStateStore] DEBUG: root tablet log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-
> f19e-4f43-bfc6-82f978b8ccb2
> 2015-05-04 21:44:16,152 [state.ZooTabletStateStore] DEBUG: Returning root tablet state: +r<<@(null,null,os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal:9997[34d1e090f8c0004])
> 2015-05-04 21:44:16,152 [master.Master] DEBUG: Root Table location State: +r<<@(null,null,os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal:9997[34d1e090f8c0004])
> 2015-05-04 21:44:16,152 [recovery.RecoveryManager] DEBUG: Recovering hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-4.novalocal+9997/fc0e8426-f295-4
> 0d7-888e-f4b7dde1f966 to hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/recovery/fc0e8426-f295-40d7-888e-f4b7dde1f966
> 2015-05-04 21:44:16,171 [recovery.RecoveryManager] DEBUG: Recovering hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal+9997/a1a2c22f-f90d-4
> 542-ab24-3ebdfa5d6fe4 to hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/recovery/a1a2c22f-f90d-4542-ab24-3ebdfa5d6fe4
> 2015-05-04 21:44:16,174 [recovery.RecoveryManager] DEBUG: Recovering hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4
> f43-bfc6-82f978b8ccb2 to hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/recovery/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
> {noformat}
> Note *088e3c39-f19e-4f43-bfc6-82f978b8ccb2*. Eventually we see that the file is gone:
> {noformat}
> 2015-05-04 21:45:15,647 [recovery.RecoveryManager] DEBUG: Unable to initate log sort for hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2: java.io.FileNotFoundException: File does not exist: /apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
>         at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
>         at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2828)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:666)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:663)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {noformat}
> The most likely candidate to direct hatred towards is the GC
> {noformat}
> [root@os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2 accumulo]# fgrep 088e3c39-f19e-4f43-bfc6-82f978b8ccb2 gc_os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal.debug.log
> 2015-05-04 08:30:15,210 [gc.GarbageCollectWriteAheadLogs] INFO : Checking replication table for hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
> 2015-05-04 08:30:15,223 [gc.GarbageCollectWriteAheadLogs] DEBUG: WAL not needed for replication hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
> 2015-05-04 08:30:15,268 [gc.GarbageCollectWriteAheadLogs] DEBUG: Removing unused WAL for server os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997[14d1df69c910000] log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
> {noformat}
> Sure enough, the GC deleted the file. I'll be digging from here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)