You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2015/04/11 03:34:15 UTC

[jira] [Resolved] (HBASE-6144) Master mistakenly splits live server's HLog file

     [ https://issues.apache.org/jira/browse/HBASE-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell resolved HBASE-6144.
-----------------------------------
      Resolution: Cannot Reproduce
    Release Note:   (was: Underlying hadoop is 0.22)

Reopen if still an issue with current code

> Master mistakenly splits live server's HLog file
> ------------------------------------------------
>
>                 Key: HBASE-6144
>                 URL: https://issues.apache.org/jira/browse/HBASE-6144
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Ted Yu
>
> RS abcdn0590 is live, but Master does not have it on its onlineserver list. So, Master put up the hlog for splitting as shown in the Master log below:
> {code}
> 2012-05-17 21:43:57,692 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fnamenode.xyz.com%2Fhbase%2F.logs%2Fabcdn0590.xyz.com%2C60020%2C1337315957185-splitting%2Fabcdn0590.xyz.com%252C60020%252C1337315957185.1337315957711 acquired by abcdn0770.xyz.com,60020,1337315956278. 
> {code}
> After splitting succeeded, Master deleted the file:
> {code}
> 2012-05-17 21:43:58,721 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2Fnamenode.xyz.com%2Fhbase%2F.logs%2Fabcdn0590.xyz.com%2C60020%2C1337315957185-splitting%2Fabcdn0590.xyz.com%252C60020%252C1337315957185.1337315957711
> {code}
> RS abcdn0590 lost the lease to RS abcdn0770, and try to do a Log Roller which closes the current hlog, and create a new one, as shown in the namenode log:
> {code}
> 2012-05-17 21:43:58,422 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(newblock=blk_2867982016684075739_12741027, file=/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185-splitting/abcdn0590.xyz.com%2C60020%2C1337315957185.1337315957711, newgenerationstamp=12911920, newlength=134, newtargets=[10.115.13.24:50010, 10.115.15.46:50010, 10.115.15.23:50010]) successful
> 2012-05-17 21:43:59,883 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/.logs/abcdn0590.xyz.com,60020,1337315957185/abcdn0590.xyz.com%2C60020%2C1337315957185.1337316238882. blk_3811725326431482476_12913541{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.115.13.24:50010|RBW], ReplicaUnderConstruction[10.115.17.18:50010|RBW], ReplicaUnderConstruction[10.115.17.15:50010|RBW]]}
> {code}
>  
> When RS 0590 try to close the old hlog 1337315957711, it received fatal error below due to the original hlog is already deleted. The fatal error will cause RS abcdn0590 to shutdown itself later.
> {code}
> 2012-05-17 21:43:58,889 ERROR org.apache.hadoop.hbase.master.HMaster: Region server ^@^@abcdn0590.xyz.com,60020,1337315957185 reported a fatal error:
> ABORTING region server abcdn0590.xyz.com,60020,1337315957185: IOE in log roller
> Cause:
> java.io.FileNotFoundException: File does not exist: hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185/abcdn0590.xyz.com%2C60020%2C1337315957185.1337315957711
>         at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:742)
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:583)
>         at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
> {code}
>  
> RS abcdn0590 shutdown at around 21:44. But in the /hbase/.logs dir, it left two sub folder for the RS abcdn0590 with the same startcode 1337315957185 , they are
> ·         /hbase/.logs/abcdn0590.xyz.com,60020,1337315957185-splitting/
> ·         /hbase/.logs/abcdn0590.xyz.com,60020,1337315957185/
>  
> Later on, at around 21:46:30, Master retry log splitting, this time,  it still consider RS abcdn0590 as dead RS and try to put up its hlog for others to grab and split. It finds the folder /hbase/.logs/abcdn0590.xyz.com,60020,1337315957185/, and the first step it does is to rename it to adding suffix of –splitting.  However, the same folder already exist. The rename function does not handle the case where the destination folder already exist, instead, the behavior is putting the src folder under the dst folder, so the path structure looks like dst/src/file. In our case, It is /hbase/.logs.20120518.1204/abcdn0590.xyz.com,60020,1337315957185-splitting/abcdn0590.xyz.com,60020,1337315957185/abcdn0590.xyz.com%2C60020%2C1337315957185.1337316238882.
>  
> This is from the master log, we can see that two folders for the same RS 0590 at same startcode exists:
> {code}
> 2012-05-17 21:46:30,749 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1329941607395-splitting doesn't belong to a known region server, splitting
> 2012-05-17 21:46:30,749 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185 doesn't belong to a known region server, splitting
> 2012-05-17 21:46:30,749 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185-splitting doesn't belong to a known region server, splitting
>  
> 2012-05-17 21:46:30,962 DEBUG org.apache.hadoop.hbase.master.MasterFileSystem: Renamed region directory: hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185-splitting
> {code}
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)