You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Samir Ahmic (JIRA)" <ji...@apache.org> on 2015/10/30 13:32:27 UTC

[jira] [Updated] (HBASE-14729) SplitLogManager does not clean files from WALs folder in case of master failover

     [ https://issues.apache.org/jira/browse/HBASE-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Samir Ahmic updated HBASE-14729:
--------------------------------
    Description: 
While i was testing master failover process on master branch (distributed cluster setup) i notice following:
1. List of dead regionservers was increasing every time active master was restarted.
2. Number of folders in /hbase/WALs folder was increasing every time active master was restarted

Here is exception from master logs showing why this is happening:
{code}
2015-10-30 09:41:49,238 INFO  [ProcedureExecutor-3] master.SplitLogManager: finished splitting (more than or equal to) 0 bytes in 0 log files in [hdfs://P3cluster/hbase/WALs/hnode1,16000,1446043659224-splitting] in 21ms
2015-10-30 09:41:49,235 WARN  [ProcedureExecutor-2] master.SplitLogManager: Returning success without actually splitting and deleting all the log files in path hdfs://P3cluster/hbase/WALs/hnode1,16000,1446046595488-splitting: [FileStatus{path=hdfs://P3cluster/hbase/WALs/hnode1,16000,1446046595488-splitting/hnode1%2C16000%2C1446046595488.meta.1446046691314.meta; isDirectory=false; length=39944; replication=3; blocksize=268435456; modification_time=1446050348104; access_time=1446046691317; owner=hbase; group=supergroup; permission=rw-r--r--; isSymlink=false}]
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException): `/hbase/WALs/hnode1,16000,1446046595488-splitting is non empty': Directory is not empty
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:3524)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:3479)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3463)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:751)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:562)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

	at org.apache.hadoop.ipc.Client.call(Client.java:1411)
	at org.apache.hadoop.ipc.Client.call(Client.java:1364)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy15.delete(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete(ClientNamenodeProtocolTranslatorPB.java:490)
	at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy16.delete(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
	at com.sun.proxy.$Proxy17.delete(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
	at com.sun.proxy.$Proxy17.delete(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1726)
	at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:588)
	at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:584)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:584)
	at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:297)
	at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:400)
	at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:373)
	at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:295)
	at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.splitLogs(ServerCrashProcedure.java:388)
	at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:228)
	at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:72)
	at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:119)
	at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:452)
	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1050)
	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop(ProcedureExecutor.java:841)
	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop(ProcedureExecutor.java:794)
	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$400(ProcedureExecutor.java:75)
	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$2.run(ProcedureExecutor.java:479)
{code}

I have tracked exception to this line in SplitLogManager#splitLogDistributed
{code}
297        if (fs.exists(logDir) && !fs.delete(logDir, false))
{code}

Since  we are removing folder we need to delete recursively so this line shoud be:
{code}
 297        if (fs.exists(logDir) && !fs.delete(logDir, true))
{code} 

This solved issue. I will attach patch after some additional testing.


  was:
While i was testing master failover process on master branch (distributed cluster setup) i notice following:
1. List of dead regionservers was increasing every time active master was restarted.
2. Number of folders in /hbase/WALs folder was increasing every time active master was restarted

Here is exception from master logs showing why this is happening:
{code}
2015-10-30 09:41:49,238 INFO  [ProcedureExecutor-3] master.SplitLogManager: finished splitting (more than or equal to) 0 bytes in 0 log files in [hdfs://P3cluster/hbase/WALs/hnode1,16000,1446043659224-splitting] in 21ms
2015-10-30 09:41:49,235 WARN  [ProcedureExecutor-2] master.SplitLogManager: Returning success without actually splitting and deleting all the log files in path hdfs://P3cluster/hbase/WALs/hnode1,16000,1446046595488-splitting: [FileStatus{path=hdfs://P3cluster/hbase/WALs/hnode1,16000,1446046595488-splitting/hnode1%2C16000%2C1446046595488.meta.1446046691314.meta; isDirectory=false; length=39944; replication=3; blocksize=268435456; modification_time=1446050348104; access_time=1446046691317; owner=hbase; group=supergroup; permission=rw-r--r--; isSymlink=false}]
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException): `/hbase/WALs/hnode1,16000,1446046595488-splitting is non empty': Directory is not empty
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:3524)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:3479)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3463)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:751)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:562)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

	at org.apache.hadoop.ipc.Client.call(Client.java:1411)
	at org.apache.hadoop.ipc.Client.call(Client.java:1364)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy15.delete(Unknown Source)
{code}

I have tracked exception to this line in SplitLogManager#splitLogDistributed
{code}
297        if (fs.exists(logDir) && !fs.delete(logDir, false))
{code}

Since  we are removing folder we need to delete recursively so this line shoud be:
{code}
 297        if (fs.exists(logDir) && !fs.delete(logDir, true))
{code} 

This solved issue. I will attach patch after some additional testing.



> SplitLogManager does not clean files from WALs folder in case of master failover
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-14729
>                 URL: https://issues.apache.org/jira/browse/HBASE-14729
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 2.0.0
>            Reporter: Samir Ahmic
>            Assignee: Samir Ahmic
>
> While i was testing master failover process on master branch (distributed cluster setup) i notice following:
> 1. List of dead regionservers was increasing every time active master was restarted.
> 2. Number of folders in /hbase/WALs folder was increasing every time active master was restarted
> Here is exception from master logs showing why this is happening:
> {code}
> 2015-10-30 09:41:49,238 INFO  [ProcedureExecutor-3] master.SplitLogManager: finished splitting (more than or equal to) 0 bytes in 0 log files in [hdfs://P3cluster/hbase/WALs/hnode1,16000,1446043659224-splitting] in 21ms
> 2015-10-30 09:41:49,235 WARN  [ProcedureExecutor-2] master.SplitLogManager: Returning success without actually splitting and deleting all the log files in path hdfs://P3cluster/hbase/WALs/hnode1,16000,1446046595488-splitting: [FileStatus{path=hdfs://P3cluster/hbase/WALs/hnode1,16000,1446046595488-splitting/hnode1%2C16000%2C1446046595488.meta.1446046691314.meta; isDirectory=false; length=39944; replication=3; blocksize=268435456; modification_time=1446050348104; access_time=1446046691317; owner=hbase; group=supergroup; permission=rw-r--r--; isSymlink=false}]
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException): `/hbase/WALs/hnode1,16000,1446046595488-splitting is non empty': Directory is not empty
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:3524)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:3479)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3463)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:751)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:562)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1411)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1364)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> 	at com.sun.proxy.$Proxy15.delete(Unknown Source)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete(ClientNamenodeProtocolTranslatorPB.java:490)
> 	at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> 	at com.sun.proxy.$Proxy16.delete(Unknown Source)
> 	at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> 	at com.sun.proxy.$Proxy17.delete(Unknown Source)
> 	at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> 	at com.sun.proxy.$Proxy17.delete(Unknown Source)
> 	at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1726)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:588)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:584)
> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:584)
> 	at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:297)
> 	at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:400)
> 	at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:373)
> 	at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:295)
> 	at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.splitLogs(ServerCrashProcedure.java:388)
> 	at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:228)
> 	at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:72)
> 	at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:119)
> 	at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:452)
> 	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1050)
> 	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop(ProcedureExecutor.java:841)
> 	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop(ProcedureExecutor.java:794)
> 	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$400(ProcedureExecutor.java:75)
> 	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$2.run(ProcedureExecutor.java:479)
> {code}
> I have tracked exception to this line in SplitLogManager#splitLogDistributed
> {code}
> 297        if (fs.exists(logDir) && !fs.delete(logDir, false))
> {code}
> Since  we are removing folder we need to delete recursively so this line shoud be:
> {code}
>  297        if (fs.exists(logDir) && !fs.delete(logDir, true))
> {code} 
> This solved issue. I will attach patch after some additional testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)