You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Daniel Iancu <da...@1and1.ro> on 2011/05/23 18:27:07 UTC

live regionservers reported dead

  Hello everybody
I've run into this strange problem. We run a 6 RS cluster and suddenly 
the client application started reporting errors, region not online. In 
the web console all regionserver appeared up.  I've run hbck and got 
strange results

Number of Tables: 2
Number of live region servers: 6
Number of dead region servers: 12

Cluster was in inconsistent state. With hbase shell status 'detailed' I 
got the dead machines

12 dead servers
     search-hadoop-eu006.v300.gmx.net,60020,1305025929461
     search-hadoop-eu002.v300.gmx.net,60020,1305019508570
     search-hadoop-eu004.v300.gmx.net,60020,1305019551236
     search-hadoop-eu003.v300.gmx.net,60020,1305025688666
     search-hadoop-eu005.v300.gmx.net,60020,1305025841017
     search-hadoop-eu006.v300.gmx.net,60020,1306156842070
     search-hadoop-eu005.v300.gmx.net,60020,1305019568146
     search-hadoop-eu001.v300.gmx.net,60020,1305025543786
     search-hadoop-eu004.v300.gmx.net,60020,1305025761173
     search-hadoop-eu002.v300.gmx.net,60020,1305025611163
     search-hadoop-eu006.v300.gmx.net,60020,1305019572576
     search-hadoop-eu003.v300.gmx.net,60020,1305019547053


It appears that all live regionserver are listed as dead also. I tried 
hbck -fix and the cluster is now in Ok state but still reports 12 
machines dead as above.
I've checked the logs but nothing obvious. Any idea? We use CDH3u0.


Thanks
Daniel

Re: live regionservers reported dead

Posted by Stack <st...@duboce.net>.

On Mon, May 23, 2011 at 9:27 AM, Daniel Iancu <da...@1and1.ro> wrote:
>  Hello everybody
> I've run into this strange problem. We run a 6 RS cluster and suddenly the
> client application started reporting errors, region not online. In the web
> console all regionserver appeared up.

What happened at this time (Check master log at this timestamp --
should give you a clue).


> I've run hbck and got strange results

...

> 12 dead servers
>    search-hadoop-eu006.v300.gmx.net,60020,1305025929461
>    search-hadoop-eu002.v300.gmx.net,60020,1305019508570
>    search-hadoop-eu004.v300.gmx.net,60020,1305019551236
>    search-hadoop-eu003.v300.gmx.net,60020,1305025688666
>    search-hadoop-eu005.v300.gmx.net,60020,1305025841017
>    search-hadoop-eu006.v300.gmx.net,60020,1306156842070
>    search-hadoop-eu005.v300.gmx.net,60020,1305019568146
>    search-hadoop-eu001.v300.gmx.net,60020,1305025543786
>    search-hadoop-eu004.v300.gmx.net,60020,1305025761173
>    search-hadoop-eu002.v300.gmx.net,60020,1305025611163
>    search-hadoop-eu006.v300.gmx.net,60020,1305019572576
>    search-hadoop-eu003.v300.gmx.net,60020,1305019547053
>
>

We used to hang on to the list of dead servers.  In 0.90.2 we fixed
this ("HBASE-3580  Remove RS from DeadServer when new instance checks
in").  I'm not sure this change made it into the released cdh3 (You
might check the cdh CHANGES).

So, do the online regionservers have the same startcode (the last
number listed above?). I'd guess not.

St.Ack

Re: hbase hbck error

Posted by Stack <st...@duboce.net>.

On Wed, May 25, 2011 at 10:49 AM, Jinsong Hu <ji...@hotmail.com> wrote
> if we add the root region back in, then  essentially the hbck is complaining
> every region is bad,
> which is not true.
>

I did notice and recently fix an issue where HBCK will print an ERROR
for all regions that follow a bad one so rather than just one bad
ERROR message, instead you get an ERROR the bad one and for all the
good (and bad) that follow.


> When you say I print more info, does that mean I need to modify the hbck
> code ? I might do it later
> when I can find some time.
>

Yes.  That is what I was suggesting.  The hbck is client-only
application so you could make changes and try stuff without having to
change your cluster software.


Thanks for digging in.
St.Ack

Re: hbase hbck error

Posted by Jinsong Hu <ji...@hotmail.com>.

Hi, Stack:
  You have a point. I checked the non-hbase machine's hbck's result, and it 
shows :
Summary:
2418 inconsistencies detected.
Status: INCONSISTENT
   That number seems very familiar to me, so I went to the master admin 
page, and found:
Total: 	servers: 6	 	requests=2783, regions=2417

if we add the root region back in, then  essentially the hbck is complaining 
every region is bad,
which is not true.

  On the other hand, the hbase machine hbck says
0 inconsistencies detected.
Status: OK
  that is probably too good to be true too.

I run "hadoop dfs -ls /hbase/table_name | grep region_id" and confirmed that 
in both machine,
the region's directory showed up. In both machine, I was running in hdfs 
account.

When you say I print more info, does that mean I need to modify the hbck 
code ? I might do it later
when I can find some time.

Jimmy.

--------------------------------------------------
From: "Stack" <st...@duboce.net>
Sent: Wednesday, May 25, 2011 10:03 AM
To: <us...@hbase.apache.org>
Subject: Re: hbase hbck error

> On Wed, May 25, 2011 at 9:18 AM, Jinsong Hu <ji...@hotmail.com> 
> wrote:
>> I tried several other non-hbase machines that has proper configuration, 
>> sure
>> enough, all of them complain problems.
>>
>
> This is interesting Jinsong.  For sure the configuration was pointed
> at the right filesystem.  Do you think there could have been a
> suppressed error or some such thing remotely querying the filesystem
> for the presence of region directories?  Can you add in a of
> printf'ing to see whats going on in hbck?
>
> Thanks for digging in on this.
> St.Ack
>

Re: hbase hbck error

Posted by Stack <st...@duboce.net>.

On Wed, May 25, 2011 at 9:18 AM, Jinsong Hu <ji...@hotmail.com> wrote:
> I tried several other non-hbase machines that has proper configuration, sure
> enough, all of them complain problems.
>

This is interesting Jinsong.  For sure the configuration was pointed
at the right filesystem.  Do you think there could have been a
suppressed error or some such thing remotely querying the filesystem
for the presence of region directories?  Can you add in a of
printf'ing to see whats going on in hbck?

Thanks for digging in on this.
St.Ack

Re: hbase hbck error

Posted by Jinsong Hu <ji...@hotmail.com>.

This is a follow up of what I have found . I exported the several 
complained tables to hdfs, truncate the original table, and import it again, 
and run hbck, and found that the hbck still complain the problem saying the 
hdfs directory is not there. I go to hdfs and take a look, and the region's 
hdfs directory is there. so the hbck's complain is bogus this time.

By accident, I run the same hbck on one of the regionserver, and to my 
surprise, the hbck check comes out clean for all tables !  I then run this 
command in several other regionserver, and then all 3 hbase masters, all of 
the come out clean ,
even for the table that has problem before and I didn't export and import.

I tried several other non-hbase machines that has proper configuration, sure 
enough, all of them complain problems.

So it seems the result of hbck depends on non-hbase machine or hbase 
machine. Judging from the results they show,
none of them is correct. The correct result should be the imported tables 
are clean and non-imported tables are not.

Can anybody explain why hbck have this kind of behavior ?

Jimmy

--------------------------------------------------
From: "Jinsong Hu" <ji...@hotmail.com>
Sent: Monday, May 23, 2011 11:39 AM
To: <us...@hbase.apache.org>
Subject: Re: hbase hbck error

> I checked the master, unfortunately ,  I must have wrong setting that all 
> master log are not there.
> So I checked the regionserver which hosted this region.  I have 14 days 
> log there and I grep this 02f9ec575b19864ae44e714d9245138f,
> and I don't see any log. then I searched all regionserver's log for last 
> several days , and don't see
> any log related to this region either.
>
>
> Jimmy.
>
> --------------------------------------------------
> From: "Jean-Daniel Cryans" <jd...@apache.org>
> Sent: Monday, May 23, 2011 10:53 AM
> To: <us...@hbase.apache.org>
> Subject: Re: hbase hbck error
>
>> I don't remember seeing this sort of issue a lot, or at all... Usually
>> the region would not be on .META. so it looks like a different issue.
>>
>> Could you grep the master logs and see what's the story of that
>> region? Just look for 02f9ec575b19864ae44e714d9245138f and try to
>> figure what happened to that region, might give us a clue.
>>
>> J-D
>>
>> On Mon, May 23, 2011 at 10:29 AM, Jinsong Hu <ji...@hotmail.com> 
>> wrote:
>>> Hi,
>>>
>>> today I run "hbase hbck " to check our production cluster and dev 
>>> cluster,
>>> the production cluster comes out clean, but
>>> in our dev cluster, I have seem more than 2K errors like this:
>>>
>>> ERROR: Region
>>> HEARTBEAT_MASTERPATCH,time\x09daily\x092010-08-15\x09uobkayhian_pr
>>> oduction\x09patch-0000694,1287356584131.02f9ec575b19864ae44e714d9245138f.
>>> found
>>> in META, but not in HDFS, and deployed on m0002040.ppops.net:60020
>>>
>>> I checked hbase GUI, and indeed , it is correct, the region is loaded by 
>>> the
>>> region server, but the hdfs directory
>>> is not there.
>>>
>>> I am running cdh3u0, and I wonder how this can happen. Once it has 
>>> happened,
>>> what can I do to recover to bring the table back to healthy state.
>>>
>>> Jimmy.
>>>
>>
>

Re: hbase hbck error

Posted by Jinsong Hu <ji...@hotmail.com>.

I checked the master, unfortunately ,  I must have wrong setting that all 
master log are not there.
So I checked the regionserver which hosted this region.  I have 14 days log 
there and I grep this 02f9ec575b19864ae44e714d9245138f,
and I don't see any log. then I searched all regionserver's log for last 
several days , and don't see
any log related to this region either.


Jimmy.

--------------------------------------------------
From: "Jean-Daniel Cryans" <jd...@apache.org>
Sent: Monday, May 23, 2011 10:53 AM
To: <us...@hbase.apache.org>
Subject: Re: hbase hbck error

> I don't remember seeing this sort of issue a lot, or at all... Usually
> the region would not be on .META. so it looks like a different issue.
>
> Could you grep the master logs and see what's the story of that
> region? Just look for 02f9ec575b19864ae44e714d9245138f and try to
> figure what happened to that region, might give us a clue.
>
> J-D
>
> On Mon, May 23, 2011 at 10:29 AM, Jinsong Hu <ji...@hotmail.com> 
> wrote:
>> Hi,
>>
>> today I run "hbase hbck " to check our production cluster and dev 
>> cluster,
>> the production cluster comes out clean, but
>> in our dev cluster, I have seem more than 2K errors like this:
>>
>> ERROR: Region
>> HEARTBEAT_MASTERPATCH,time\x09daily\x092010-08-15\x09uobkayhian_pr
>> oduction\x09patch-0000694,1287356584131.02f9ec575b19864ae44e714d9245138f.
>> found
>> in META, but not in HDFS, and deployed on m0002040.ppops.net:60020
>>
>> I checked hbase GUI, and indeed , it is correct, the region is loaded by 
>> the
>> region server, but the hdfs directory
>> is not there.
>>
>> I am running cdh3u0, and I wonder how this can happen. Once it has 
>> happened,
>> what can I do to recover to bring the table back to healthy state.
>>
>> Jimmy.
>>
>

Re: hbase hbck error

Posted by Jean-Daniel Cryans <jd...@apache.org>.

I don't remember seeing this sort of issue a lot, or at all... Usually
the region would not be on .META. so it looks like a different issue.

Could you grep the master logs and see what's the story of that
region? Just look for 02f9ec575b19864ae44e714d9245138f and try to
figure what happened to that region, might give us a clue.

J-D

On Mon, May 23, 2011 at 10:29 AM, Jinsong Hu <ji...@hotmail.com> wrote:
> Hi,
>
> today I run "hbase hbck " to check our production cluster and dev cluster,
> the production cluster comes out clean, but
> in our dev cluster, I have seem more than 2K errors like this:
>
> ERROR: Region
> HEARTBEAT_MASTERPATCH,time\x09daily\x092010-08-15\x09uobkayhian_pr
> oduction\x09patch-0000694,1287356584131.02f9ec575b19864ae44e714d9245138f.
> found
> in META, but not in HDFS, and deployed on m0002040.ppops.net:60020
>
> I checked hbase GUI, and indeed , it is correct, the region is loaded by the
> region server, but the hdfs directory
> is not there.
>
> I am running cdh3u0, and I wonder how this can happen. Once it has happened,
> what can I do to recover to bring the table back to healthy state.
>
> Jimmy.
>

hbase hbck error

Posted by Jinsong Hu <ji...@hotmail.com>.

Hi,

today I run "hbase hbck " to check our production cluster and dev cluster, 
the production cluster comes out clean, but
in our dev cluster, I have seem more than 2K errors like this:

ERROR: Region 
HEARTBEAT_MASTERPATCH,time\x09daily\x092010-08-15\x09uobkayhian_pr
oduction\x09patch-0000694,1287356584131.02f9ec575b19864ae44e714d9245138f. 
found
in META, but not in HDFS, and deployed on m0002040.ppops.net:60020

I checked hbase GUI, and indeed , it is correct, the region is loaded by the 
region server, but the hdfs directory
is not there.

I am running cdh3u0, and I wonder how this can happen. Once it has happened, 
what can I do to recover to bring the table back to healthy state.

Jimmy.

Re: live regionservers reported dead

Posted by Jean-Daniel Cryans <jd...@apache.org>.

It was fixed in 0.90.3, before that we didn't clear the list.

J-D

On Mon, May 23, 2011 at 9:27 AM, Daniel Iancu <da...@1and1.ro> wrote:
>  Hello everybody
> I've run into this strange problem. We run a 6 RS cluster and suddenly the
> client application started reporting errors, region not online. In the web
> console all regionserver appeared up.  I've run hbck and got strange results
>
> Number of Tables: 2
> Number of live region servers: 6
> Number of dead region servers: 12
>
> Cluster was in inconsistent state. With hbase shell status 'detailed' I got
> the dead machines
>
> 12 dead servers
>    search-hadoop-eu006.v300.gmx.net,60020,1305025929461
>    search-hadoop-eu002.v300.gmx.net,60020,1305019508570
>    search-hadoop-eu004.v300.gmx.net,60020,1305019551236
>    search-hadoop-eu003.v300.gmx.net,60020,1305025688666
>    search-hadoop-eu005.v300.gmx.net,60020,1305025841017
>    search-hadoop-eu006.v300.gmx.net,60020,1306156842070
>    search-hadoop-eu005.v300.gmx.net,60020,1305019568146
>    search-hadoop-eu001.v300.gmx.net,60020,1305025543786
>    search-hadoop-eu004.v300.gmx.net,60020,1305025761173
>    search-hadoop-eu002.v300.gmx.net,60020,1305025611163
>    search-hadoop-eu006.v300.gmx.net,60020,1305019572576
>    search-hadoop-eu003.v300.gmx.net,60020,1305019547053
>
>
> It appears that all live regionserver are listed as dead also. I tried hbck
> -fix and the cluster is now in Ok state but still reports 12 machines dead
> as above.
> I've checked the logs but nothing obvious. Any idea? We use CDH3u0.
>
>
> Thanks
> Daniel
>
>
>
>