You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Michał Podsiadłowski <po...@gmail.com> on 2010/03/05 11:03:14 UTC

Hmaster fails to detect and retry failed region assign attempt.

Hi hbase-users!

Yesterday we did quite important test. On our production environment we
introduced hbase as a webcache layer (first step in integrating it to our
env), and in controlled manner we tried to brake it ;). We started to stop
various elements starting from datanode, hregion etc.. Everyting was working
very nicely until my coworker started to simulated disaster - he shutdown
2/3 of our cluster - 2 datanodes/hregions from 3. It was still fine though
query times were significantly higher - which wasn't surprise.
Then one of the hregions was started by watchdog and just after is stoodup
my friend invoked stop. Regions already started to be migrated to this node
and one of them was assigned by hmaster, opened on hregion (there is a
message in a log) but confirmation didn't arrive to Hmaster. Region location
was not saved to meta and this state was sustained till hmaster and the same
all regions restart. We couldn't scan it or get any row from that region,
nor disabled the table. It looked to us like master gave up tring to assing
the region or it assumed that regions was successfullly assinged and opened.
I know that scenario we simulated was not "normal usecase but stil we think
that cluster should recouperate after some time even from such a disaster.
Just to clarify all data from this table were replicated so no blocks were
missing.
Our hbase is 0.20.3 from Cloudera and hadoop is 0.20.1 also clean Cloudera
release. (any patches are adviced ?)

Our cluster is consisted of 4 physical machines divided by with xen:
3 machines divided to   datanodes + hregions  - 4gb ram / zookeeper 512 mb
ram / our other app
+ 4th machine divided to namenode 2gb / secondar namenode 2gb / hmaster 1gb

Region causing problem is _old-home,,1267642988312.

Some logs you can find here:
fwdn2 - Region server that was stoped during assiging regions -
http://pastebin.com/uL48KCjd

Due to unknown reasons log from master is corrupted and after some point is
appears as @@@@@@@.. in vim, yesterday it was fine though.
What i saw there was something like this

10/03/04 11:24:18 INFO master.RegionManager: Assigning region
_old-home,,1267642988312 to fwdn1,60020,1267698243695
and then nothing more about this region.

Any help appreciated.

Thanks
Michal

Re: Hmaster fails to detect and retry failed region assign attempt.

Posted by Michał Podsiadłowski <po...@gmail.com>.

W dniu 5 marca 2010 11:12 użytkownik Jonathan Gray <jl...@streamy.com>napisał:

> Hey Michal,
>
> There was an issue in the past where ROOT would not be properly reassigned
> if there was only a single server left.
>
> https://issues.apache.org/jira/browse/HBASE-1908
>
> But that was fixed back in 0.20.2.
>
> Can you post the master log?
>
> No, I can't - Dunno what happens but after some point there are only @
signs when i try to open it with vim of gedit  - it got corrupted. And of
course it couldn't happen after that point ..

When it was still fine when there were no suspicious messages.  There was
message about assigning that faulty region to fwdn2 - server that was
shut-down  - like this

10/03/04 11:24:18 INFO master.RegionManager: Assigning region
_old-home,,1267642988312 to fwdn1,60020,1267698243695
and then nothing more about this region.

But no confirmation about success on master

RE: Hmaster fails to detect and retry failed region assign attempt.

Posted by Jonathan Gray <jl...@streamy.com>.

Hey Michal,

There was an issue in the past where ROOT would not be properly reassigned
if there was only a single server left.

https://issues.apache.org/jira/browse/HBASE-1908

But that was fixed back in 0.20.2.

Can you post the master log?

JG

-----Original Message-----
From: Michał Podsiadłowski [mailto:podsiadlowski@gmail.com] 
Sent: Friday, March 05, 2010 2:03 AM
To: hbase-user@hadoop.apache.org
Subject: Hmaster fails to detect and retry failed region assign attempt.

Hi hbase-users!

Yesterday we did quite important test. On our production environment we
introduced hbase as a webcache layer (first step in integrating it to our
env), and in controlled manner we tried to brake it ;). We started to stop
various elements starting from datanode, hregion etc.. Everyting was working
very nicely until my coworker started to simulated disaster - he shutdown
2/3 of our cluster - 2 datanodes/hregions from 3. It was still fine though
query times were significantly higher - which wasn't surprise.
Then one of the hregions was started by watchdog and just after is stoodup
my friend invoked stop. Regions already started to be migrated to this node
and one of them was assigned by hmaster, opened on hregion (there is a
message in a log) but confirmation didn't arrive to Hmaster. Region location
was not saved to meta and this state was sustained till hmaster and the same
all regions restart. We couldn't scan it or get any row from that region,
nor disabled the table. It looked to us like master gave up tring to assing
the region or it assumed that regions was successfullly assinged and opened.
I know that scenario we simulated was not "normal usecase but stil we think
that cluster should recouperate after some time even from such a disaster.
Just to clarify all data from this table were replicated so no blocks were
missing.
Our hbase is 0.20.3 from Cloudera and hadoop is 0.20.1 also clean Cloudera
release. (any patches are adviced ?)

Our cluster is consisted of 4 physical machines divided by with xen:
3 machines divided to   datanodes + hregions  - 4gb ram / zookeeper 512 mb
ram / our other app
+ 4th machine divided to namenode 2gb / secondar namenode 2gb / hmaster 1gb

Region causing problem is _old-home,,1267642988312.

Some logs you can find here:
fwdn2 - Region server that was stoped during assiging regions -
http://pastebin.com/uL48KCjd

Due to unknown reasons log from master is corrupted and after some point is
appears as @@@@@@@.. in vim, yesterday it was fine though.
What i saw there was something like this

10/03/04 11:24:18 INFO master.RegionManager: Assigning region
_old-home,,1267642988312 to fwdn1,60020,1267698243695
and then nothing more about this region.

Any help appreciated.

Thanks
Michal

Re: Hmaster fails to detect and retry failed region assign attempt.

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Michal,

Doing tests like that on such a small cluster is basically just asking
for trouble ;)

So you might be hitting http://issues.apache.org/jira/browse/HBASE-2244

Also for clusters under 10 nodes you absolutely need hadoop 0.20.2,
which has https://issues.apache.org/jira/browse/HDFS-872. The short
story is that the Namenode will keep sending wrong information to the
DFSClient the region server is using... also it has
http://issues.apache.org/jira/browse/HDFS-101

Finally don't forget that hadoop 0.20 doesn't support fs sync, if you
kill -9 the region server holding ROOT or META you might lose some
rows if they are very recent.

WRT to your very problem, unless we have all region server logs (you
don't have that many), the master log and a timeline of when your
coworker started shutting down stuff it's going to be hard to debug.

thx,

J-D

2010/3/5 Michał Podsiadłowski <po...@gmail.com>:
> Hi hbase-users!
>
> Yesterday we did quite important test. On our production environment we
> introduced hbase as a webcache layer (first step in integrating it to our
> env), and in controlled manner we tried to brake it ;). We started to stop
> various elements starting from datanode, hregion etc.. Everyting was working
> very nicely until my coworker started to simulated disaster - he shutdown
> 2/3 of our cluster - 2 datanodes/hregions from 3. It was still fine though
> query times were significantly higher - which wasn't surprise.
> Then one of the hregions was started by watchdog and just after is stoodup
> my friend invoked stop. Regions already started to be migrated to this node
> and one of them was assigned by hmaster, opened on hregion (there is a
> message in a log) but confirmation didn't arrive to Hmaster. Region location
> was not saved to meta and this state was sustained till hmaster and the same
> all regions restart. We couldn't scan it or get any row from that region,
> nor disabled the table. It looked to us like master gave up tring to assing
> the region or it assumed that regions was successfullly assinged and opened.
> I know that scenario we simulated was not "normal usecase but stil we think
> that cluster should recouperate after some time even from such a disaster.
> Just to clarify all data from this table were replicated so no blocks were
> missing.
> Our hbase is 0.20.3 from Cloudera and hadoop is 0.20.1 also clean Cloudera
> release. (any patches are adviced ?)
>
> Our cluster is consisted of 4 physical machines divided by with xen:
> 3 machines divided to   datanodes + hregions  - 4gb ram / zookeeper 512 mb
> ram / our other app
> + 4th machine divided to namenode 2gb / secondar namenode 2gb / hmaster 1gb
>
> Region causing problem is _old-home,,1267642988312.
>
> Some logs you can find here:
> fwdn2 - Region server that was stoped during assiging regions -
> http://pastebin.com/uL48KCjd
>
> Due to unknown reasons log from master is corrupted and after some point is
> appears as @@@@@@@.. in vim, yesterday it was fine though.
> What i saw there was something like this
>
> 10/03/04 11:24:18 INFO master.RegionManager: Assigning region
> _old-home,,1267642988312 to fwdn1,60020,1267698243695
> and then nothing more about this region.
>
> Any help appreciated.
>
> Thanks
> Michal
>