You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Samir Ahmic (JIRA)" <ji...@apache.org> on 2015/11/03 15:02:28 UTC

[jira] [Commented] (HBASE-14223) Meta WALs are not cleared if meta region was closed and RS aborts

    [ https://issues.apache.org/jira/browse/HBASE-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14987312#comment-14987312 ] 

Samir Ahmic commented on HBASE-14223:
-------------------------------------

As per [~enis] suggestion from [-HBASE-14729-|https://issues.apache.org/jira/browse/HBASE-14729] i have pickup his patch for testing, and i was able  with few additions to build hbase on master branch. After applying patch issue was still present ( i was testing on distributed cluster : :hadoop-2.5.2, zookeeper-3.4.6, hbase-2.0.0-SNAPSHOT ) but i have found some interesting things in logs which may put some more light on this issue.
I was tracking single case when meta WALs file is not cleared here is sequence:
HDFS layout:
{code}
[hadoop@hnode2 ~]$ hadoop fs -ls -R /hbase/WALs
drwxr-xr-x   - hbase supergroup          0 2015-11-03 10:14 /hbase/WALs/hnode2,16000,1446541683349-splitting
-rw-r--r--   3 hbase supergroup         83 2015-11-03 10:13 /hbase/WALs/hnode2,16000,1446541683349-splitting/hnode2%2C16000%2C1446541683349.meta.1446541987763.meta
{code}

Few time in logs i have notice that we are reporting 0 files in -splitting- folder but files were on HDFS so i have added some additional logging in DefaultWALProvider#isMetaFile(), we were using this method to filter files from WALs folder
{code}
public static boolean isMetaFile(String p) {
    if (p != null && p.endsWith(META_WAL_PROVIDER_ID)) {
      LOG.info("***Returning TRUE for: " + p);
      return true;
    }
    LOG.info("***Returning FALSE for: "+p);
    return false;
  }
{code} 
And here are logs for active master that was processing WALs file:
{code}
2015-11-03 10:14:32,059 INFO  [hnode2:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349 doesn't belong to a known region server, splitting
2015-11-03 10:14:33,686 DEBUG [hnode2:16000.activeMasterManager] procedure2.ProcedureExecutor: Procedure ServerCrashProcedure serverName=hnode2,16000,1446541683349, shouldSplitWal=true, carryingMeta=false id=852 state=RUNNABLE:SERVER_CRASH_START added to the store.
2015-11-03 10:14:33,921 DEBUG [ProcedureExecutor-1] procedure.ServerCrashProcedure: Splitting logs from hnode2,16000,1446541683349; region count=0
2015-11-03 10:14:33,944 DEBUG [ProcedureExecutor-1] master.MasterFileSystem: Renamed region directory: hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349-splitting
2015-11-03 10:14:33,944 INFO  [ProcedureExecutor-1] master.SplitLogManager: dead splitlog workers [hnode2,16000,1446541683349]
2015-11-03 10:14:33,968 INFO  [ProcedureExecutor-1] wal.DefaultWALProvider: ***Returning FALSE for: hnode2%2C16000%2C1446541683349.1446541986629
2015-11-03 10:14:33,969 INFO  [ProcedureExecutor-1] wal.DefaultWALProvider: ***Returning TRUE for: hnode2%2C16000%2C1446541683349.meta.1446541987763.meta
2015-11-03 10:14:33,969 INFO  [ProcedureExecutor-1] master.SplitLogManager: Started splitting 1 logs in [hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349-splitting] for [hnode2,16000,1446541683349]
2015-11-03 10:14:34,065 INFO  [SplitLogWorker-hnode2:16000] wal.DefaultWALProvider: ***Returning FALSE for: WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629
2015-11-03 10:14:34,080 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination: put up splitlog task at znode /hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629
2015-11-03 10:14:34,123 INFO  [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629 acquired by hnode5,16020,1446455348779
2015-11-03 10:14:34,135 INFO  [main-EventThread] coordination.ZkSplitLogWorkerCoordination: task /hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629 preempted from hnode2,16000,1446542058824, current task state and owner=OWNED hnode5,16020,1446455348779
2015-11-03 10:14:34,963 INFO  [hnode2,16000,1446542058824_splitLogManager__ChoreService_1] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629=last_update = 1446542074134 last_version = 2 cur_worker_name = hnode5,16020,1446455348779 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2015-11-03 10:14:38,183 INFO  [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629 entered state: DONE hnode5,16020,1446455348779
2015-11-03 10:14:38,199 INFO  [main-EventThread] wal.WALSplitter: Archived processed log hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349-splitting/hnode2%2C16000%2C1446541683349.1446541986629 to hdfs://P3cluster/hbase/oldWALs/hnode2%2C16000%2C1446541683349.1446541986629
2015-11-03 10:14:38,202 INFO  [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629

2015-11-03 10:14:38,209 WARN  [ProcedureExecutor-1] master.SplitLogManager: Returning success without actually splitting and deleting all the log files in path hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349-splitting: [FileStatus{path=hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349-splitting/hnode2%2C16000%2C1446541683349.meta.1446541987763.meta; isDirectory=false; length=83; replication=3; blocksize=268435456; modification_time=1446541987767; access_time=1446541987767; owner=hbase; group=supergroup; permission=rw-r--r--; isSymlink=false}]
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException): `/hbase/WALs/hnode2,16000,1446541683349-splitting is non empty': Directory is not empty

2015-11-03 10:14:38,210 INFO  [ProcedureExecutor-1] master.SplitLogManager: finished splitting (more than or equal to) 83 bytes in 1 log files in [hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349-splitting] in 4241ms

2015-11-03 10:14:38,422 INFO  [ProcedureExecutor-0] procedure.ServerCrashProcedure: Finished processing of crashed hnode2,16000,1446541683349
2015-11-03 10:14:38,526 DEBUG [ProcedureExecutor-0] procedure2.ProcedureExecutor: Procedure completed in 4.8450sec: ServerCrashProcedure serverName=hnode2,16000,1446541683349, shouldSplitWal=true, carryingMeta=false id=852 state=FINISHED
{code}
What look wrong here are this lines:
{code}
2015-11-03 10:14:34,065 INFO  [SplitLogWorker-hnode2:16000] wal.DefaultWALProvider: ***Returning FALSE for: WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629
2015-11-03 10:14:34,080 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination: put up splitlog task at znode /hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629
2015-11-03 10:14:34,123 INFO  [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629 acquired by hnode5,16020,1446455348779
{code}

We are clearly sending non existing WALs file to be processed. I'm not sure why file name was deformed (i suspect on PathFilter filter). And rs which acquires this task throws:
{code}
2015-11-03 10:14:38,118 INFO  [RS_LOG_REPLAY_OPS-hnode5:16020-0] util.FSHDFSUtils: Recovered lease, attempt=1 on file=hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349-splitting/hnode2%2C16000%2C1446541683349.1446541986629 after 4005ms
2015-11-03 10:14:38,120 WARN  [RS_LOG_REPLAY_OPS-hnode5:16020-0] shortcircuit.DomainSocketFactory: error creating DomainSocket
java.net.ConnectException: connect(2) error: No such file or directory when trying to connect to 'none'
{code} 

So i have removed filter parametar from SplitLogManager#getFileList()
{code}
-      FileStatus[] logfiles = FSUtils.listStatus(fs, logDir, filter);
+      FileStatus[] logfiles = FSUtils.listStatus(fs, logDir);
{code}
and as result issue was gone in my case. I will attach logs from master in case with filter and without filter.
 



> Meta WALs are not cleared if meta region was closed and RS aborts
> -----------------------------------------------------------------
>
>                 Key: HBASE-14223
>                 URL: https://issues.apache.org/jira/browse/HBASE-14223
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>             Fix For: 2.0.0, 1.2.0, 1.3.0, 1.0.3, 1.1.4
>
>         Attachments: hbase-14223_v0.patch
>
>
> When an RS opens meta, and later closes it, the WAL(FSHlog) is not closed. The last WAL file just sits there in the RS WAL directory. If RS stops gracefully, the WAL file for meta is deleted. Otherwise if RS aborts, WAL for meta is not cleaned. It is also not split (which is correct) since master determines that the RS no longer hosts meta at the time of RS abort. 
> From a cluster after running ITBLL with CM, I see a lot of {{-splitting}} directories left uncleaned: 
> {code}
> [root@os-enis-dal-test-jun-4-7 cluster-os]# sudo -u hdfs hadoop fs -ls /apps/hbase/data/WALs
> Found 31 items
> drwxr-xr-x   - hbase hadoop          0 2015-06-05 01:14 /apps/hbase/data/WALs/hregion-58203265
> drwxr-xr-x   - hbase hadoop          0 2015-06-05 07:54 /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433489308745-splitting
> drwxr-xr-x   - hbase hadoop          0 2015-06-05 09:28 /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433494382959-splitting
> drwxr-xr-x   - hbase hadoop          0 2015-06-05 10:01 /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433498252205-splitting
> ...
> {code}
> The directories contain WALs from meta: 
> {code}
> [root@os-enis-dal-test-jun-4-7 cluster-os]# sudo -u hdfs hadoop fs -ls /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting
> Found 2 items
> -rw-r--r--   3 hbase hadoop     201608 2015-06-05 03:15 /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433470511501.meta
> -rw-r--r--   3 hbase hadoop      44420 2015-06-05 04:36 /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433474111645.meta
> {code}
> The RS hosted the meta region for some time: 
> {code}
> 2015-06-05 03:14:28,692 INFO  [PostOpenDeployTasks:1588230740] zookeeper.MetaTableLocator: Setting hbase:meta region location in ZooKeeper as os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285
> ...
> 2015-06-05 03:15:17,302 INFO  [RS_CLOSE_META-os-enis-dal-test-jun-4-5:16020-0] regionserver.HRegion: Closed hbase:meta,,1.1588230740
> {code}
> In between, a WAL is created: 
> {code}
> 2015-06-05 03:15:11,707 INFO  [RS_OPEN_META-os-enis-dal-test-jun-4-5:16020-0-MetaLogRoller] wal.FSHLog: Rolled WAL /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433470511501.meta with entries=385, filesize=196.88 KB; new WAL /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433474111645.meta
> {code}
> When CM killed the region server later master did not see these WAL files: 
> {code}
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:46,075 INFO  [MASTER_SERVER_OPERATIONS-os-enis-dal-test-jun-4-3:16000-0] master.SplitLogManager: started splitting 2 logs in [hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting] for [os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285]
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:47,300 INFO  [main-EventThread] wal.WALSplitter: Archived processed log hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285.default.1433475074436 to hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285.default.1433475074436
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:50,497 INFO  [main-EventThread] wal.WALSplitter: Archived processed log hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285.default.1433475175329 to hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285.default.1433475175329
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:50,507 WARN  [MASTER_SERVER_OPERATIONS-os-enis-dal-test-jun-4-3:16000-0] master.SplitLogManager: returning success without actually splitting and deleting all the log files in path hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:50,508 INFO  [MASTER_SERVER_OPERATIONS-os-enis-dal-test-jun-4-3:16000-0] master.SplitLogManager: finished splitting (more than or equal to) 129135000 bytes in 2 log files in [hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting] in 4433ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)