You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Murali Krishna. P" <mu...@yahoo.com> on 2011/02/12 16:02:40 UTC

HBase stops responding, after restart got 'oldlogfile.log' missing error and didnot not start.

Hi all,
   I have a 4 node hbase cluster (0.20.6), host1 has master and other 3 has 
region servers. Following series of event happened.

1. host4 got disconnected and total region servers became 2. (znode expired). 
What could be the reason.
2. Some hlog splits happened and reassignment happened (2 region, 0 dead, <- why 
isn't it 1 dead?)
3. One more znode expire event, and it became '1 region servers, 1 dead'
3. java.io.IOException: DFSClient_-889445375 could not complete file 
/user/adamaplo/hbase/.META./1028785192/oldlogfile.log.  Giving up.
4. It is stuck after this for long time (more then hour logging few things 
repeatedly, see the logs)
5. I restart master and region server at this point.
6. It is unable to get some logfile and refuse to start up. 'ava.io.IOException: 
Could not obtain block: blk_-4927328817223373854_1605408 file=/user/a
damaplo/hbase/Table/1675479948/oldlogfile.log

7. All the region servers also loging similar errors.
8. When I tried to get it from dfs, it was not able to locate the block. hadoop 
fsck showed the block available in one of the datanode but couldn't get it.
9. After some time, the file got removed from the dfs (who does this? compaction 
or some other activity?)
10. after 9, hbase was back to normal

This is a critical problem for us since the service was unavailable for more 
than 2 hours. I have attached the master logs. Please help me understand  each 
of the above problems and a possible fix. 


Thanks for the support.

Murali Krishna

Re: HBase stops responding, after restart got 'oldlogfile.log' missing error and didnot not start.

Posted by Jean-Daniel Cryans <jd...@apache.org>.
There are scores of issues that were fixed in 0.89 and 0.90 around
that part of the code, so it's really hard to tell if you're hitting
any of those. My recommendation is to upgrade... none of the big
installations that I know of are on 0.20.6

J-D

On Sat, Feb 12, 2011 at 7:02 AM, Murali Krishna. P
<mu...@yahoo.com> wrote:
> Hi all,
>    I have a 4 node hbase cluster (0.20.6), host1 has master and other 3 has
> region servers. Following series of event happened.
>
> 1. host4 got disconnected and total region servers became 2. (znode
> expired). What could be the reason.
> 2. Some hlog splits happened and reassignment happened (2 region, 0 dead, <-
> why isn't it 1 dead?)
> 3. One more znode expire event, and it became '1 region servers, 1 dead'
> 3. java.io.IOException: DFSClient_-889445375 could not complete file
> /user/adamaplo/hbase/.META./1028785192/oldlogfile.log.  Giving up.
> 4. It is stuck after this for long time (more then hour logging few things
> repeatedly, see the logs)
> 5. I restart master and region server at this point.
> 6. It is unable to get some logfile and refuse to start up.
> 'ava.io.IOException: Could not obtain block:
> blk_-4927328817223373854_1605408 file=/user/a
> damaplo/hbase/Table/1675479948/oldlogfile.log
> 7. All the region servers also loging similar errors.
> 8. When I tried to get it from dfs, it was not able to locate the block.
> hadoop fsck showed the block available in one of the datanode but couldn't
> get it.
> 9. After some time, the file got removed from the dfs (who does this?
> compaction or some other activity?)
> 10. after 9, hbase was back to normal
>
> This is a critical problem for us since the service was unavailable for more
> than 2 hours. I have attached the master logs. Please help me understand
> each of the above problems and a possible fix.
>
> Thanks for the support.
>
> Murali Krishna
>