You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Michał Podsiadłowski <po...@gmail.com> on 2010/01/29 14:01:09 UTC

Hbase fails at moderate load.

Hi all!

I'm in the middle of some performance and stability testing of out small
hbase cluster to check if it suitable for out application.
We want to use it as web cache persistence layer for out web app which
handles quite large amount of traffic.
Of course i have lot's of problems with it.

Main one is that client applications (web servers) can persist of retrieve
rows and fail miserably with exceptions like this:

org.apache.hadoop.hbase.client.NoServerForRegionException: No server address
listed in .META. for region
oldAppWebSingleRowCacheStore,filmMenuCuriosities-not_selected\xC2\xAC150,1264766907002
        at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1048)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:417)

 could not retrieve persisted cache id 'filmRanking' for key '3872'
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
region server 10.0.100.51:60020 for region
oldAppWebSingleRowCacheStore,filmRanking\xC2\xAC3746,1264766860498, row
'filmRanking\xC2\xAC3872', but failed after 2 attempts.
Exceptions:
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException:
oldAppWebSingleRowCacheStore,filmRanking�3746,1264766860498
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2266)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1730)

This happens every time when first region starts to split. As far as i can
see table is set to enabled *false* (web admin), web admin becomes little
bit less responsible - listing table regions shows no regions.
and after a while i can see 500 or more regions. Some of them as exception
shows are not fully available. HDFS doesn't seems to be the main issue. When
i run fsck it says hbase dir is healthy apart from some under replicated
blocks. Occasionaly i saw that some blocks where missing but i think this
was due to "Too many files open" exceptions (to small regions size - now
it's default 64)
Amount of data is not enormous - around 1gb in less then 100k rows then this
problems starts to occur. Request per seconds is i think small - 20-30 per
second.
What else i can say is I've set the max hbase retry to only 2 because we
can't allow clients to wait more for response.

What i would like to know is whether the table is always disabled when
performing region splits? And is it truly disabled then so that clients
can't do anything?
It looks like status says disabled but still requests are processed, though,
with different results (some like above).



My cluster setup can be probably useful -
3 centos virtual machines based on xen running DN/HR and zookeeper  + one of
them NodeMaster and Secondary Master.
2 gigs of ram on each. Currently hadoop processes run with Xmx 512 and hbase
with 256 but non of them is swapping nor going out of memory.
GC logs looks normal - stop the world is not occurring ;)
top says cpus are nearly idle on all machines.

It's far from ideal but we need to prove that this can work reliably to get
more toys.
Maybe next week we will be able to test on some better machines but for now
that all what I've got.


Any advices are welcome.


Thanks,
Michal

Re: Hbase fails at moderate load.

Posted by Michał Podsiadłowski <po...@gmail.com>.
Hi,
it's me again having problem - hope this is not another misconfiguration
problem ( or maybe it would be better it it was one).
After loading some moderate amount of data - around 3GB some rows are not
available due to strange exceptions

java.io.IOException: java.io.IOException: Cannot open filename
/hbase/filmContributors/1670715971/content/3783592739034234831

When trying to scan the table regions server pukes like this

2010-02-03 16:03:39,060 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 7 on 60020, call next(2813423168169765496, 1) from 10.0.100.50:41364:
error: java.io.IOException: java.lang.RuntimeException: java.io.IOException:
Cannot open filename
/hbase/filmContributors/1670715971/content/3783592739034234831
java.io.IOException: java.lang.RuntimeException: java.io.IOException: Cannot
open filename /hbase/filmContributors/1670715971/content/3783592739034234831
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:872)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:862)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1918)
    at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
    at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
Caused by: java.lang.RuntimeException: java.io.IOException: Cannot open
filename /hbase/filmContributors/1670715971/content/3783592739034234831
    at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:61)
    at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:79)
    at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:164)
    at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:106)
    at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.nextInternal(HRegion.java:1807)
    at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:1771)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1894)
    ... 5 more
Caused by: java.io.IOException: Cannot open filename
/hbase/filmContributors/1670715971/content/3783592739034234831
    at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
    at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
    at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
    at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
    at java.io.DataInputStream.read(DataInputStream.java:132)
    at
org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
    at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
    at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
    at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
    at
org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1163)
    at
org.apache.hadoop.hbase.io.HalfHFileReader$1.next(HalfHFileReader.java:125)
    at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:58)
    ... 11 more


greping regionserver log for dir name 1670715971 shows this
2010-02-03 15:32:37,082 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/314440477/content/3783592739034234831.1670715971,
isReference=true, sequence id=7541774, length=33390929,
majorCompaction=false
2010-02-03 15:32:37,088 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/314440477/content/6518523095287027530.1670715971,
isReference=true, sequence id=7542003, length=7890, majorCompaction=false
2010-02-03 15:32:37,095 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/314440477/description/2305635563712489918.1670715971,
isReference=true, sequence id=7542003, length=2256, majorCompaction=false
2010-02-03 15:32:37,101 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/314440477/description/6970032752270852156.1670715971,
isReference=true, sequence id=7541774, length=6664268, majorCompaction=false
2010-02-03 15:32:37,129 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/1836766931/content/3783592739034234831.1670715971,
isReference=true, sequence id=7541773, length=33390929,
majorCompaction=false
2010-02-03 15:32:37,152 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/1836766931/content/6518523095287027530.1670715971,
isReference=true, sequence id=7542002, length=7890, majorCompaction=false
2010-02-03 15:32:37,165 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/1836766931/description/2305635563712489918.1670715971,
isReference=true, sequence id=7542002, length=2256, majorCompaction=false
2010-02-03 15:32:37,170 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/1836766931/description/6970032752270852156.1670715971,
isReference=true, sequence id=7541773, length=6664268, majorCompaction=false
2010-02-03 15:33:49,943 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read:
java.io.IOException: Cannot open filename
/hbase/filmContributors/1670715971/content/3783592739034234831
and many many times java.io.IOException: Cannot open filename
/hbase/filmContributors/1670715971/content/3783592739034234831

*on different one I found this *

2010-02-03 15:32:35,512 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Cleaned up /hbase/filmContributors/1670715971/splits true
2010-02-03 15:32:35,515 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split, META
updated, and report to master all successful. Old region=REGION => {NAME =>
'filmContributor
s,,1265203126633', STARTKEY => '', ENDKEY => '31587', ENCODED => 1670715971,
OFFLINE => true, SPLIT => true, TABLE => {{NAME => 'filmContributors',
MAX_FILESIZE => '268435456', FAMILIES => [{NAME =
> 'content', COMPRESSION => 'NONE', VERSIONS => '1', TTL => '2147483647',
BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME =>
'description', COMPRESSION => 'NONE', VERSIONS
=> '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}]}}, new regions: filmContributors,,1265207555247,
filmContributors,117416,1265207555247. Split took 0s
ec
**
more details here - http://pastebin.com/d7c52f27a

Also sometimes namenode logs i can see messages like this:

2010-02-03 15:32:38,416 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 54310, call
complete(/hbase/filmContributors/compaction.dir/1836766931/2633146516707160051,
DFSClient_-902184734)
from 10.0.100.51:49692: error: java.io.IOException: Could not complete write
to file
/hbase/filmContributors/compaction.dir/1836766931/2633146516707160051 by
DFSClient_-902184734
java.io.IOException: Could not complete write to file
/hbase/filmContributors/compaction.dir/1836766931/2633146516707160051 by
DFSClient_-902184734


Please help.

Cheers,
Michal

Re: Hbase fails at moderate load.

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Glad to hear it's working for you!

J-D

2010/2/2 Michał Podsiadłowski <po...@gmail.com>:
> I swear I've seen there "in megabytes" or maybe it was somewhere else..  -
> of course after setting the correct value all seems to work like a charm
> Again RTFM style problem, great thanks for help.
>
> Cheers,
> Michal
>

Re: Hbase fails at moderate load.

Posted by Michał Podsiadłowski <po...@gmail.com>.
I swear I've seen there "in megabytes" or maybe it was somewhere else..  -
of course after setting the correct value all seems to work like a charm
Again RTFM style problem, great thanks for help.

Cheers,
Michal

Re: Hbase fails at moderate load.

Posted by Jean-Daniel Cryans <jd...@apache.org>.
>  But if I get 130 tables out of 64 megs how many i will get after splinting
> 1gig? Can you tell me what triggers further splits.

Correction, you set it to 64 bytes, not megabytes. That's 1024*1024
times smaller! The splits happen when one of the families is bigger
than MAX_FILESIZE.

> First one is triggered by exceeding "split at" size what is expect but after this i should get 2 x
> 32 megs. And then after one of them will grow up again above max filesize
> limit it should be splited. Am i right?

Well we don't split exactly in the middle, we can't do that since we
could separate a row in 2. We do a best effort to split in the middle
of the row keys.

J-D

>
> 2010/2/1 Jean-Daniel Cryans <jd...@apache.org>
>
>> From what you pasted:
>>
>> 2010-02-01 14:05:49,445 INFO
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split,
>> META updated, and report to master all successful. Old region=REGION
>> => {NAME => 'oldWebSingleRowCacheStore,,1265029544146', STARTKEY =>
>> '', ENDKEY => 'filmMenuEditions-not_selected\xC2\xAC1405', ENCODED =>
>> 1899385768, OFFLINE => true, SPLIT => true, TABLE => {{NAME =>
>> 'oldWebSingleRowCacheStore', MAX_FILESIZE => '64', FAMILIES => [{NAME
>> => 'content', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
>> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
>> => 'true'}, {NAME => 'description', COMPRESSION => 'NONE', VERSIONS =>
>> '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
>> BLOCKCACHE => 'true'}]}}, new regions:
>> oldWebSingleRowCacheStore,,1265029549167,
>> oldWebSingleRowCacheStore,filmLastTopics\xC2\xAC1155,1265029549167.
>> Split took 0sec
>>
>> I see MAX_FILESIZE => '64' which means you have set that table to
>> split after 64 _bytes_ so either use the default value of 256MB
>> (256*1024*1024) or even higher if you wish (I set usually set them to
>> 1GB).
>>
>>
>

Re: Hbase fails at moderate load.

Posted by Michał Podsiadłowski <po...@gmail.com>.
 But if I get 130 tables out of 64 megs how many i will get after splinting
1gig? Can you tell me what triggers further splits. First one is triggered
by exceeding "split at" size what is expect but after this i should get 2 x
32 megs. And then after one of them will grow up again above max filesize
limit it should be splited. Am i right?

2010/2/1 Jean-Daniel Cryans <jd...@apache.org>

> From what you pasted:
>
> 2010-02-01 14:05:49,445 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split,
> META updated, and report to master all successful. Old region=REGION
> => {NAME => 'oldWebSingleRowCacheStore,,1265029544146', STARTKEY =>
> '', ENDKEY => 'filmMenuEditions-not_selected\xC2\xAC1405', ENCODED =>
> 1899385768, OFFLINE => true, SPLIT => true, TABLE => {{NAME =>
> 'oldWebSingleRowCacheStore', MAX_FILESIZE => '64', FAMILIES => [{NAME
> => 'content', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
> => 'true'}, {NAME => 'description', COMPRESSION => 'NONE', VERSIONS =>
> '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
> BLOCKCACHE => 'true'}]}}, new regions:
> oldWebSingleRowCacheStore,,1265029549167,
> oldWebSingleRowCacheStore,filmLastTopics\xC2\xAC1155,1265029549167.
> Split took 0sec
>
> I see MAX_FILESIZE => '64' which means you have set that table to
> split after 64 _bytes_ so either use the default value of 256MB
> (256*1024*1024) or even higher if you wish (I set usually set them to
> 1GB).
>
>

Re: Hbase fails at moderate load.

Posted by Jean-Daniel Cryans <jd...@apache.org>.
>From what you pasted:

2010-02-01 14:05:49,445 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split,
META updated, and report to master all successful. Old region=REGION
=> {NAME => 'oldWebSingleRowCacheStore,,1265029544146', STARTKEY =>
'', ENDKEY => 'filmMenuEditions-not_selected\xC2\xAC1405', ENCODED =>
1899385768, OFFLINE => true, SPLIT => true, TABLE => {{NAME =>
'oldWebSingleRowCacheStore', MAX_FILESIZE => '64', FAMILIES => [{NAME
=> 'content', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
'2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}, {NAME => 'description', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}]}}, new regions:
oldWebSingleRowCacheStore,,1265029549167,
oldWebSingleRowCacheStore,filmLastTopics\xC2\xAC1155,1265029549167.
Split took 0sec

I see MAX_FILESIZE => '64' which means you have set that table to
split after 64 _bytes_ so either use the default value of 256MB
(256*1024*1024) or even higher if you wish (I set usually set them to
1GB).

J-D

2010/2/1 Michał Podsiadłowski <po...@gmail.com>:
> Hi Stack,
> thanks for your last input.
>
> I've started new week with few tweaks of environment. I've put down one of
> the web servers so i gained additional node.
> I've put there HMaster, both namenodes and zookeeper and requested from our
> IT stuff some additional memory to rest of nodes.
>
> Now setup is like this:
> Namenode + Secondary Namenode + HMaster @ 1GB + zookeeper @256MB - machine
> with 4gb
> 3 x datanodes/hregions   - DataNode @768Mb  + HRegion @1GB  - machines  2GB
> of ram
> 2 additional zookeepers @256MB on webservers that are uploading to hbase.
>
> Probably more memory for OS cache/buffors on datanodes would be useful but
> free -m after quite long upload says:
> *             total       used       free     shared    buffers     cached
> Mem:          2048        903       1144          0         37        362
> -/+ buffers/cache:        503       1544
> Swap:         1019          0       1019
>
> *All is based on hadoop 0.20.2 and hbase 0.20.3.
>
>
> All seems to be much more stable.
> Too many open files is no longer a problem (max file size - 16mb was wrong
> idea).
> But still problem with dividing very first region occured.
> For around 1 minute regions were dividing and dividing till they reach total
> count around 130.
> During that time in .META. some regions were not assigned to servers  ( exp.
> no address for region in .META.).
> But I think i haven't seen problems with hitting wrong regions or not
> serving regions.
> This is something that really freaks us out, because potential this can
> happen every region split
> and then whole application can go bananas.
> Can someone explain why regions are dividing so rapidly and to such a
> quantity?
>
> http://pastebin.com/m73276a36 - here you can find a piece of log from that
> moment
>
>
> Cheers,
> Michal
>
> *
>
> *
>
> 2010/1/31 Stack <st...@duboce.net>
>
>> What Tim said and then some comments in the below.
>>
>> What version of hbase?
>>
>>
>> >
>> > This happens every time when first region starts to split. As far as i
>> can
>> > see table is set to enabled *false* (web admin), web admin becomes little
>> > bit less responsible - listing table regions shows no regions.
>> > and after a while i can see 500 or more regions.
>>
>> You go from zero to 500 regions with nothing showing in between?
>> Thats pretty impressive.  500 regions in 256M on 3 servers is probably
>> pushing it
>>
>>  Some of them as exception
>> > shows are not fully available.
>>
>> Identify the duff regions by running a full table scan in the shell
>> with DEBUG enabled on the client.  It'll puke when it hits the first
>> broke region
>>
>> HDFS doesn't seems to be the main issue. When
>> > i run fsck it says hbase dir is healthy apart from some under replicated
>> > blocks. Occasionaly i saw that some blocks where missing but i think this
>> > was due to "Too many files open" exceptions (to small regions size - now
>> > it's default 64)
>>
>> Too many open files is bad.  Check out the hbase 'Getting Started'.
>>
>>
>> > Amount of data is not enormous - around 1gb in less then 100k rows then
>> this
>> > problems starts to occur. Request per seconds is i think small - 20-30
>> per
>> > second.
>> > What else i can say is I've set the max hbase retry to only 2 because we
>> > can't allow clients to wait more for response.
>> >
>>
>> I would suggest you leave things at default till running smooth then
>> start in optimizing.
>>
>>
>> > What i would like to know is whether the table is always disabled when
>> > performing region splits?
>>
>> No.  Region goes offline for some period of time.  If machines are
>> heavily loaded it will take longer for it to come back on line again.
>>
>> And is it truly disabled then so that clients
>> > can't do anything?
>> > It looks like status says disabled but still requests are processed,
>> though,
>> > with different results (some like above).
>> >
>>
>> Disabled or 'offline'?   Parents of region splits go offline and are
>> replaced by new daughter splits.
>>
>> >
>> >
>> > My cluster setup can be probably useful -
>> > 3 centos virtual machines based on xen running DN/HR and zookeeper  + one
>> of
>> > them NodeMaster and Secondary Master.
>> > 2 gigs of ram on each. Currently hadoop processes run with Xmx 512 and
>> hbase
>> > with 256 but non of them is swapping nor going out of memory.
>> > GC logs looks normal - stop the world is not occurring ;)
>>
>>
>> Really?  No full GCs though only 256 and though about 100 plus regions
>> per server?
>>
>> > top says cpus are nearly idle on all machines.
>> >
>> > It's far from ideal but we need to prove that this can work reliably to
>> get
>> > more toys.
>> > Maybe next week we will be able to test on some better machines but for
>> now
>> > that all what I've got.
>> >
>> Makes sense.  You are starting very small though and virtual machines
>> have proven a flakey foundation for hbase.  Read back over the list
>> and look for ec2 mentions.
>>
>> St.Ack
>>
>> >
>> > Any advices are welcome.
>> >
>> >
>> > Thanks,
>> > Michal
>> >
>>
>

Re: Hbase fails at moderate load.

Posted by Michał Podsiadłowski <po...@gmail.com>.
Hi Stack,
thanks for your last input.

I've started new week with few tweaks of environment. I've put down one of
the web servers so i gained additional node.
I've put there HMaster, both namenodes and zookeeper and requested from our
IT stuff some additional memory to rest of nodes.

Now setup is like this:
Namenode + Secondary Namenode + HMaster @ 1GB + zookeeper @256MB - machine
with 4gb
3 x datanodes/hregions   - DataNode @768Mb  + HRegion @1GB  - machines  2GB
of ram
2 additional zookeepers @256MB on webservers that are uploading to hbase.

Probably more memory for OS cache/buffors on datanodes would be useful but
free -m after quite long upload says:
*             total       used       free     shared    buffers     cached
Mem:          2048        903       1144          0         37        362
-/+ buffers/cache:        503       1544
Swap:         1019          0       1019

*All is based on hadoop 0.20.2 and hbase 0.20.3.


All seems to be much more stable.
Too many open files is no longer a problem (max file size - 16mb was wrong
idea).
But still problem with dividing very first region occured.
For around 1 minute regions were dividing and dividing till they reach total
count around 130.
During that time in .META. some regions were not assigned to servers  ( exp.
no address for region in .META.).
But I think i haven't seen problems with hitting wrong regions or not
serving regions.
This is something that really freaks us out, because potential this can
happen every region split
and then whole application can go bananas.
Can someone explain why regions are dividing so rapidly and to such a
quantity?

http://pastebin.com/m73276a36 - here you can find a piece of log from that
moment


Cheers,
Michal

*

*

2010/1/31 Stack <st...@duboce.net>

> What Tim said and then some comments in the below.
>
> What version of hbase?
>
>
> >
> > This happens every time when first region starts to split. As far as i
> can
> > see table is set to enabled *false* (web admin), web admin becomes little
> > bit less responsible - listing table regions shows no regions.
> > and after a while i can see 500 or more regions.
>
> You go from zero to 500 regions with nothing showing in between?
> Thats pretty impressive.  500 regions in 256M on 3 servers is probably
> pushing it
>
>  Some of them as exception
> > shows are not fully available.
>
> Identify the duff regions by running a full table scan in the shell
> with DEBUG enabled on the client.  It'll puke when it hits the first
> broke region
>
> HDFS doesn't seems to be the main issue. When
> > i run fsck it says hbase dir is healthy apart from some under replicated
> > blocks. Occasionaly i saw that some blocks where missing but i think this
> > was due to "Too many files open" exceptions (to small regions size - now
> > it's default 64)
>
> Too many open files is bad.  Check out the hbase 'Getting Started'.
>
>
> > Amount of data is not enormous - around 1gb in less then 100k rows then
> this
> > problems starts to occur. Request per seconds is i think small - 20-30
> per
> > second.
> > What else i can say is I've set the max hbase retry to only 2 because we
> > can't allow clients to wait more for response.
> >
>
> I would suggest you leave things at default till running smooth then
> start in optimizing.
>
>
> > What i would like to know is whether the table is always disabled when
> > performing region splits?
>
> No.  Region goes offline for some period of time.  If machines are
> heavily loaded it will take longer for it to come back on line again.
>
> And is it truly disabled then so that clients
> > can't do anything?
> > It looks like status says disabled but still requests are processed,
> though,
> > with different results (some like above).
> >
>
> Disabled or 'offline'?   Parents of region splits go offline and are
> replaced by new daughter splits.
>
> >
> >
> > My cluster setup can be probably useful -
> > 3 centos virtual machines based on xen running DN/HR and zookeeper  + one
> of
> > them NodeMaster and Secondary Master.
> > 2 gigs of ram on each. Currently hadoop processes run with Xmx 512 and
> hbase
> > with 256 but non of them is swapping nor going out of memory.
> > GC logs looks normal - stop the world is not occurring ;)
>
>
> Really?  No full GCs though only 256 and though about 100 plus regions
> per server?
>
> > top says cpus are nearly idle on all machines.
> >
> > It's far from ideal but we need to prove that this can work reliably to
> get
> > more toys.
> > Maybe next week we will be able to test on some better machines but for
> now
> > that all what I've got.
> >
> Makes sense.  You are starting very small though and virtual machines
> have proven a flakey foundation for hbase.  Read back over the list
> and look for ec2 mentions.
>
> St.Ack
>
> >
> > Any advices are welcome.
> >
> >
> > Thanks,
> > Michal
> >
>

Re: Hbase fails at moderate load.

Posted by Stack <st...@duboce.net>.
What Tim said and then some comments in the below.

What version of hbase?


>
> This happens every time when first region starts to split. As far as i can
> see table is set to enabled *false* (web admin), web admin becomes little
> bit less responsible - listing table regions shows no regions.
> and after a while i can see 500 or more regions.

You go from zero to 500 regions with nothing showing in between?
Thats pretty impressive.  500 regions in 256M on 3 servers is probably
pushing it

 Some of them as exception
> shows are not fully available.

Identify the duff regions by running a full table scan in the shell
with DEBUG enabled on the client.  It'll puke when it hits the first
broke region

HDFS doesn't seems to be the main issue. When
> i run fsck it says hbase dir is healthy apart from some under replicated
> blocks. Occasionaly i saw that some blocks where missing but i think this
> was due to "Too many files open" exceptions (to small regions size - now
> it's default 64)

Too many open files is bad.  Check out the hbase 'Getting Started'.


> Amount of data is not enormous - around 1gb in less then 100k rows then this
> problems starts to occur. Request per seconds is i think small - 20-30 per
> second.
> What else i can say is I've set the max hbase retry to only 2 because we
> can't allow clients to wait more for response.
>

I would suggest you leave things at default till running smooth then
start in optimizing.


> What i would like to know is whether the table is always disabled when
> performing region splits?

No.  Region goes offline for some period of time.  If machines are
heavily loaded it will take longer for it to come back on line again.

And is it truly disabled then so that clients
> can't do anything?
> It looks like status says disabled but still requests are processed, though,
> with different results (some like above).
>

Disabled or 'offline'?   Parents of region splits go offline and are
replaced by new daughter splits.

>
>
> My cluster setup can be probably useful -
> 3 centos virtual machines based on xen running DN/HR and zookeeper  + one of
> them NodeMaster and Secondary Master.
> 2 gigs of ram on each. Currently hadoop processes run with Xmx 512 and hbase
> with 256 but non of them is swapping nor going out of memory.
> GC logs looks normal - stop the world is not occurring ;)


Really?  No full GCs though only 256 and though about 100 plus regions
per server?

> top says cpus are nearly idle on all machines.
>
> It's far from ideal but we need to prove that this can work reliably to get
> more toys.
> Maybe next week we will be able to test on some better machines but for now
> that all what I've got.
>
Makes sense.  You are starting very small though and virtual machines
have proven a flakey foundation for hbase.  Read back over the list
and look for ec2 mentions.

St.Ack

>
> Any advices are welcome.
>
>
> Thanks,
> Michal
>

Re: Hbase fails at moderate load.

Posted by Tim Robertson <ti...@gmail.com>.
Hi Michal,

[Disclaimer: I am not well experienced in HBase]

Those seem like very low memory allocations for HBase from what I've
seen / observed on this list.  I was told to not consider less than 8G
for those demons.  It could be that you need to increase all the lease
times to allow for the split to happen.

Just an idea, but if you needed to, you could consider Amazon EC2 with
XLarge instances for a very small amount to prove the concept.

Cheers,
Tim




2010/1/29 Michał Podsiadłowski <po...@gmail.com>:
> Hi all!
>
> I'm in the middle of some performance and stability testing of out small
> hbase cluster to check if it suitable for out application.
> We want to use it as web cache persistence layer for out web app which
> handles quite large amount of traffic.
> Of course i have lot's of problems with it.
>
> Main one is that client applications (web servers) can persist of retrieve
> rows and fail miserably with exceptions like this:
>
> org.apache.hadoop.hbase.client.NoServerForRegionException: No server address
> listed in .META. for region
> oldAppWebSingleRowCacheStore,filmMenuCuriosities-not_selected\xC2\xAC150,1264766907002
>        at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1048)
>        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:417)
>
>  could not retrieve persisted cache id 'filmRanking' for key '3872'
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
> region server 10.0.100.51:60020 for region
> oldAppWebSingleRowCacheStore,filmRanking\xC2\xAC3746,1264766860498, row
> 'filmRanking\xC2\xAC3872', but failed after 2 attempts.
> Exceptions:
> org.apache.hadoop.hbase.NotServingRegionException:
> org.apache.hadoop.hbase.NotServingRegionException:
> oldAppWebSingleRowCacheStore,filmRanking�3746,1264766860498
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2266)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1730)
>
> This happens every time when first region starts to split. As far as i can
> see table is set to enabled *false* (web admin), web admin becomes little
> bit less responsible - listing table regions shows no regions.
> and after a while i can see 500 or more regions. Some of them as exception
> shows are not fully available. HDFS doesn't seems to be the main issue. When
> i run fsck it says hbase dir is healthy apart from some under replicated
> blocks. Occasionaly i saw that some blocks where missing but i think this
> was due to "Too many files open" exceptions (to small regions size - now
> it's default 64)
> Amount of data is not enormous - around 1gb in less then 100k rows then this
> problems starts to occur. Request per seconds is i think small - 20-30 per
> second.
> What else i can say is I've set the max hbase retry to only 2 because we
> can't allow clients to wait more for response.
>
> What i would like to know is whether the table is always disabled when
> performing region splits? And is it truly disabled then so that clients
> can't do anything?
> It looks like status says disabled but still requests are processed, though,
> with different results (some like above).
>
>
>
> My cluster setup can be probably useful -
> 3 centos virtual machines based on xen running DN/HR and zookeeper  + one of
> them NodeMaster and Secondary Master.
> 2 gigs of ram on each. Currently hadoop processes run with Xmx 512 and hbase
> with 256 but non of them is swapping nor going out of memory.
> GC logs looks normal - stop the world is not occurring ;)
> top says cpus are nearly idle on all machines.
>
> It's far from ideal but we need to prove that this can work reliably to get
> more toys.
> Maybe next week we will be able to test on some better machines but for now
> that all what I've got.
>
>
> Any advices are welcome.
>
>
> Thanks,
> Michal
>