You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Seraph Imalia <se...@eisp.co.za> on 2011/12/28 15:27:25 UTC

RAM Problems - Keeps Crashing

Hi Guys,

After updating from 0.20.6 to 0.90.4, we have been having serious RAM issues. I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but with 0.90.4 even 4.5 Gigs seems not enough. It does not matter how much load the hbase services are under, it just crashes after 24-48 hours. The only difference the load makes is how quickly the services crash. Even over this holiday season with our lowest load of the year, it crashes just after 36 hours of being started. To fix it, I have to run the stop-hbase.sh command, wait a while and kill -9 any hbase processes that have stopped outputting logs or stopped responding, and then run start-hbase.sh again.

Attached are my logs from the latest "start-to-crash". There are 3 servers and hbase is being used for storing URL's - 7 client servers connect to hbase and perform URL Lookups at about 40 requests per second (this is the low load over this holiday season). If the URL does not exist, it gets added. The Key on the HTable is the URL and there are a few fields stored against it - e.g. DateDiscovered, Host, Script, QueryString, etc.

Each server has a hadoop datanode and an hbase regionserver and 1 of the servers additionally has the namenode, master and zookeeper. On first start, each regionserver uses 2 Gigs (usedHeap) and as soon as I restart the clients, the usedHeap slowly climes until it reaches the maxHeap and shortly after that, the regionservers start crashing - sometimes they actually shutdown gracefully by themselves.

Originally, we had hbase.regionserver.handler.count set to 100 and I have now removed that to leave it as default which has not helped.

We have not made any changes to the clients and we have a mirrored instance of this in our UK Data Centre which is still running 0.20.6 and servicing 10 clients currently at over 300 requests per second (again low load over the holidays) and it is 100% stable.

What do I do now? - your website says I cannot downgrade?

Please help

Regards,
Seraph

Re: RAM Problems - Keeps Crashing

Posted by Seraph Imalia <se...@eisp.co.za>.

Hi Guys,

I am quite desperate to resolve the issue I am having, what is the next thing I can try to resolve this issue? - All I can think of is to downgrade back to 0.20.6 and write an am to migrate all the data :(

I hope everyone had a great new year :)

Regards,
Seraph


On 30 Dec 2011, at 7:41 PM, Seraph Imalia wrote:

> Setting that property to false has not made any difference, hbase has just crashed again (ran out of heap) and I am busy restarting it.  What do I do now?
> 
> 
> On 29 Dec 2011, at 5:56 PM, Seraph Imalia wrote:
> 
>> Thanks,
>> 
>> I will try disabling it to see if the memory is being taken up by MSLAB.
>> 
>> Regards,
>> Seraph
>> 
>> On 29 Dec 2011, at 5:47 PM, Ted Yu wrote:
>> 
>>> mslab was introduced after 0.20.6
>>> 
>>> Read Todd's series:
>>> http://www.cloudera.com/blog/2011/03/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-3/
>>> 
>>> Cheers
>>> 
>>> On Thu, Dec 29, 2011 at 12:19 AM, Seraph Imalia <se...@eisp.co.za> wrote:
>>> 
>>>> Region Servers
>>>> 
>>>> Address                         Start Code              Load
>>>> dynobuntu10:60030       1325081250180   requests=43, regions=224,
>>>> usedHeap=3946, maxHeap=4087
>>>> dynobuntu12:60030       1325081249966   requests=32, regions=224,
>>>> usedHeap=3821, maxHeap=4087
>>>> dynobuntu17:60030       1325081248407   requests=39, regions=225,
>>>> usedHeap=4016, maxHeap=4087
>>>> Total:  servers: 3              requests=114, regions=673
>>>> 
>>>> I restarted them yesterday and the number of regions increased from 667 to
>>>> 673 and they are about to run out of heap again :(.  Should I set that
>>>> property to false? - what does mslab do? - is it new after 0.20.6?
>>>> 
>>>> Regards,
>>>> Seraph
>>>> 
>>>> On 28 Dec 2011, at 5:46 PM, Ted Yu wrote:
>>>> 
>>>>> Can you tell me how many regions each region server hosts ?
>>>>> 
>>>>> In 0.90.4 there is this parameter:
>>>>> <name>hbase.hregion.memstore.mslab.enabled</name>
>>>>> <value>true</value>
>>>>> mslab tends to consume heap if region count is high.
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> On Wed, Dec 28, 2011 at 6:27 AM, Seraph Imalia <se...@eisp.co.za>
>>>> wrote:
>>>>> 
>>>>>> Hi Guys,
>>>>>> 
>>>>>> After updating from 0.20.6 to 0.90.4, we have been having serious RAM
>>>>>> issues.  I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but
>>>> with
>>>>>> 0.90.4 even 4.5 Gigs seems not enough.  It does not matter how much load
>>>>>> the hbase services are under, it just crashes after 24-48 hours.  The
>>>> only
>>>>>> difference the load makes is how quickly the services crash.  Even over
>>>>>> this holiday season with our lowest load of the year, it crashes just
>>>> after
>>>>>> 36 hours of being started.  To fix it, I have to run the stop-hbase.sh
>>>>>> command, wait a while and kill -9 any hbase processes that have stopped
>>>>>> outputting logs or stopped responding, and then run start-hbase.sh
>>>> again.
>>>>>> 
>>>>>> Attached are my logs from the latest "start-to-crash".  There are 3
>>>>>> servers and hbase is being used for storing URL's - 7 client servers
>>>>>> connect to hbase and perform URL Lookups at about 40 requests per second
>>>>>> (this is the low load over this holiday season).  If the URL does not
>>>>>> exist, it gets added.  The Key on the HTable is the URL and there are a
>>>> few
>>>>>> fields stored against it - e.g. DateDiscovered, Host, Script,
>>>> QueryString,
>>>>>> etc.
>>>>>> 
>>>>>> Each server has a hadoop datanode and an hbase regionserver and 1 of the
>>>>>> servers additionally has the namenode, master and zookeeper.  On first
>>>>>> start, each regionserver uses 2 Gigs (usedHeap) and as soon as I restart
>>>>>> the clients, the usedHeap slowly climes until it reaches the maxHeap and
>>>>>> shortly after that, the regionservers start crashing - sometimes they
>>>>>> actually shutdown gracefully by themselves.
>>>>>> 
>>>>>> Originally, we had hbase.regionserver.handler.count set to 100 and I
>>>> have
>>>>>> now removed that to leave it as default which has not helped.
>>>>>> 
>>>>>> We have not made any changes to the clients and we have a mirrored
>>>>>> instance of this in our UK Data Centre which is still running 0.20.6 and
>>>>>> servicing 10 clients currently at over 300 requests per second (again
>>>> low
>>>>>> load over the holidays) and it is 100% stable.
>>>>>> 
>>>>>> What do I do now? - your website says I cannot downgrade?
>>>>>> 
>>>>>> Please help
>>>>>> 
>>>>>> Regards,
>>>>>> Seraph
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>

Re: RAM Problems - Keeps Crashing

Posted by Seraph Imalia <se...@eisp.co.za>.

Setting that property to false has not made any difference, hbase has just crashed again (ran out of heap) and I am busy restarting it.  What do I do now?


On 29 Dec 2011, at 5:56 PM, Seraph Imalia wrote:

> Thanks,
> 
> I will try disabling it to see if the memory is being taken up by MSLAB.
> 
> Regards,
> Seraph
> 
> On 29 Dec 2011, at 5:47 PM, Ted Yu wrote:
> 
>> mslab was introduced after 0.20.6
>> 
>> Read Todd's series:
>> http://www.cloudera.com/blog/2011/03/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-3/
>> 
>> Cheers
>> 
>> On Thu, Dec 29, 2011 at 12:19 AM, Seraph Imalia <se...@eisp.co.za> wrote:
>> 
>>> Region Servers
>>> 
>>> Address                         Start Code              Load
>>> dynobuntu10:60030       1325081250180   requests=43, regions=224,
>>> usedHeap=3946, maxHeap=4087
>>> dynobuntu12:60030       1325081249966   requests=32, regions=224,
>>> usedHeap=3821, maxHeap=4087
>>> dynobuntu17:60030       1325081248407   requests=39, regions=225,
>>> usedHeap=4016, maxHeap=4087
>>> Total:  servers: 3              requests=114, regions=673
>>> 
>>> I restarted them yesterday and the number of regions increased from 667 to
>>> 673 and they are about to run out of heap again :(.  Should I set that
>>> property to false? - what does mslab do? - is it new after 0.20.6?
>>> 
>>> Regards,
>>> Seraph
>>> 
>>> On 28 Dec 2011, at 5:46 PM, Ted Yu wrote:
>>> 
>>>> Can you tell me how many regions each region server hosts ?
>>>> 
>>>> In 0.90.4 there is this parameter:
>>>> <name>hbase.hregion.memstore.mslab.enabled</name>
>>>> <value>true</value>
>>>> mslab tends to consume heap if region count is high.
>>>> 
>>>> Cheers
>>>> 
>>>> On Wed, Dec 28, 2011 at 6:27 AM, Seraph Imalia <se...@eisp.co.za>
>>> wrote:
>>>> 
>>>>> Hi Guys,
>>>>> 
>>>>> After updating from 0.20.6 to 0.90.4, we have been having serious RAM
>>>>> issues.  I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but
>>> with
>>>>> 0.90.4 even 4.5 Gigs seems not enough.  It does not matter how much load
>>>>> the hbase services are under, it just crashes after 24-48 hours.  The
>>> only
>>>>> difference the load makes is how quickly the services crash.  Even over
>>>>> this holiday season with our lowest load of the year, it crashes just
>>> after
>>>>> 36 hours of being started.  To fix it, I have to run the stop-hbase.sh
>>>>> command, wait a while and kill -9 any hbase processes that have stopped
>>>>> outputting logs or stopped responding, and then run start-hbase.sh
>>> again.
>>>>> 
>>>>> Attached are my logs from the latest "start-to-crash".  There are 3
>>>>> servers and hbase is being used for storing URL's - 7 client servers
>>>>> connect to hbase and perform URL Lookups at about 40 requests per second
>>>>> (this is the low load over this holiday season).  If the URL does not
>>>>> exist, it gets added.  The Key on the HTable is the URL and there are a
>>> few
>>>>> fields stored against it - e.g. DateDiscovered, Host, Script,
>>> QueryString,
>>>>> etc.
>>>>> 
>>>>> Each server has a hadoop datanode and an hbase regionserver and 1 of the
>>>>> servers additionally has the namenode, master and zookeeper.  On first
>>>>> start, each regionserver uses 2 Gigs (usedHeap) and as soon as I restart
>>>>> the clients, the usedHeap slowly climes until it reaches the maxHeap and
>>>>> shortly after that, the regionservers start crashing - sometimes they
>>>>> actually shutdown gracefully by themselves.
>>>>> 
>>>>> Originally, we had hbase.regionserver.handler.count set to 100 and I
>>> have
>>>>> now removed that to leave it as default which has not helped.
>>>>> 
>>>>> We have not made any changes to the clients and we have a mirrored
>>>>> instance of this in our UK Data Centre which is still running 0.20.6 and
>>>>> servicing 10 clients currently at over 300 requests per second (again
>>> low
>>>>> load over the holidays) and it is 100% stable.
>>>>> 
>>>>> What do I do now? - your website says I cannot downgrade?
>>>>> 
>>>>> Please help
>>>>> 
>>>>> Regards,
>>>>> Seraph
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>

Re: RAM Problems - Keeps Crashing

Posted by Seraph Imalia <se...@eisp.co.za>.

Thanks,

I will try disabling it to see if the memory is being taken up by MSLAB.

Regards,
Seraph

On 29 Dec 2011, at 5:47 PM, Ted Yu wrote:

> mslab was introduced after 0.20.6
> 
> Read Todd's series:
> http://www.cloudera.com/blog/2011/03/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-3/
> 
> Cheers
> 
> On Thu, Dec 29, 2011 at 12:19 AM, Seraph Imalia <se...@eisp.co.za> wrote:
> 
>> Region Servers
>> 
>> Address                         Start Code              Load
>> dynobuntu10:60030       1325081250180   requests=43, regions=224,
>> usedHeap=3946, maxHeap=4087
>> dynobuntu12:60030       1325081249966   requests=32, regions=224,
>> usedHeap=3821, maxHeap=4087
>> dynobuntu17:60030       1325081248407   requests=39, regions=225,
>> usedHeap=4016, maxHeap=4087
>> Total:  servers: 3              requests=114, regions=673
>> 
>> I restarted them yesterday and the number of regions increased from 667 to
>> 673 and they are about to run out of heap again :(.  Should I set that
>> property to false? - what does mslab do? - is it new after 0.20.6?
>> 
>> Regards,
>> Seraph
>> 
>> On 28 Dec 2011, at 5:46 PM, Ted Yu wrote:
>> 
>>> Can you tell me how many regions each region server hosts ?
>>> 
>>> In 0.90.4 there is this parameter:
>>> <name>hbase.hregion.memstore.mslab.enabled</name>
>>> <value>true</value>
>>> mslab tends to consume heap if region count is high.
>>> 
>>> Cheers
>>> 
>>> On Wed, Dec 28, 2011 at 6:27 AM, Seraph Imalia <se...@eisp.co.za>
>> wrote:
>>> 
>>>> Hi Guys,
>>>> 
>>>> After updating from 0.20.6 to 0.90.4, we have been having serious RAM
>>>> issues.  I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but
>> with
>>>> 0.90.4 even 4.5 Gigs seems not enough.  It does not matter how much load
>>>> the hbase services are under, it just crashes after 24-48 hours.  The
>> only
>>>> difference the load makes is how quickly the services crash.  Even over
>>>> this holiday season with our lowest load of the year, it crashes just
>> after
>>>> 36 hours of being started.  To fix it, I have to run the stop-hbase.sh
>>>> command, wait a while and kill -9 any hbase processes that have stopped
>>>> outputting logs or stopped responding, and then run start-hbase.sh
>> again.
>>>> 
>>>> Attached are my logs from the latest "start-to-crash".  There are 3
>>>> servers and hbase is being used for storing URL's - 7 client servers
>>>> connect to hbase and perform URL Lookups at about 40 requests per second
>>>> (this is the low load over this holiday season).  If the URL does not
>>>> exist, it gets added.  The Key on the HTable is the URL and there are a
>> few
>>>> fields stored against it - e.g. DateDiscovered, Host, Script,
>> QueryString,
>>>> etc.
>>>> 
>>>> Each server has a hadoop datanode and an hbase regionserver and 1 of the
>>>> servers additionally has the namenode, master and zookeeper.  On first
>>>> start, each regionserver uses 2 Gigs (usedHeap) and as soon as I restart
>>>> the clients, the usedHeap slowly climes until it reaches the maxHeap and
>>>> shortly after that, the regionservers start crashing - sometimes they
>>>> actually shutdown gracefully by themselves.
>>>> 
>>>> Originally, we had hbase.regionserver.handler.count set to 100 and I
>> have
>>>> now removed that to leave it as default which has not helped.
>>>> 
>>>> We have not made any changes to the clients and we have a mirrored
>>>> instance of this in our UK Data Centre which is still running 0.20.6 and
>>>> servicing 10 clients currently at over 300 requests per second (again
>> low
>>>> load over the holidays) and it is 100% stable.
>>>> 
>>>> What do I do now? - your website says I cannot downgrade?
>>>> 
>>>> Please help
>>>> 
>>>> Regards,
>>>> Seraph
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>>

Re: RAM Problems - Keeps Crashing

Posted by Ted Yu <yu...@gmail.com>.

mslab was introduced after 0.20.6

Read Todd's series:
http://www.cloudera.com/blog/2011/03/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-3/

Cheers

On Thu, Dec 29, 2011 at 12:19 AM, Seraph Imalia <se...@eisp.co.za> wrote:

> Region Servers
>
> Address                         Start Code              Load
> dynobuntu10:60030       1325081250180   requests=43, regions=224,
> usedHeap=3946, maxHeap=4087
> dynobuntu12:60030       1325081249966   requests=32, regions=224,
> usedHeap=3821, maxHeap=4087
> dynobuntu17:60030       1325081248407   requests=39, regions=225,
> usedHeap=4016, maxHeap=4087
> Total:  servers: 3              requests=114, regions=673
>
> I restarted them yesterday and the number of regions increased from 667 to
> 673 and they are about to run out of heap again :(.  Should I set that
> property to false? - what does mslab do? - is it new after 0.20.6?
>
> Regards,
> Seraph
>
> On 28 Dec 2011, at 5:46 PM, Ted Yu wrote:
>
> > Can you tell me how many regions each region server hosts ?
> >
> > In 0.90.4 there is this parameter:
> >    <name>hbase.hregion.memstore.mslab.enabled</name>
> >    <value>true</value>
> > mslab tends to consume heap if region count is high.
> >
> > Cheers
> >
> > On Wed, Dec 28, 2011 at 6:27 AM, Seraph Imalia <se...@eisp.co.za>
> wrote:
> >
> >> Hi Guys,
> >>
> >> After updating from 0.20.6 to 0.90.4, we have been having serious RAM
> >> issues.  I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but
> with
> >> 0.90.4 even 4.5 Gigs seems not enough.  It does not matter how much load
> >> the hbase services are under, it just crashes after 24-48 hours.  The
> only
> >> difference the load makes is how quickly the services crash.  Even over
> >> this holiday season with our lowest load of the year, it crashes just
> after
> >> 36 hours of being started.  To fix it, I have to run the stop-hbase.sh
> >> command, wait a while and kill -9 any hbase processes that have stopped
> >> outputting logs or stopped responding, and then run start-hbase.sh
> again.
> >>
> >> Attached are my logs from the latest "start-to-crash".  There are 3
> >> servers and hbase is being used for storing URL's - 7 client servers
> >> connect to hbase and perform URL Lookups at about 40 requests per second
> >> (this is the low load over this holiday season).  If the URL does not
> >> exist, it gets added.  The Key on the HTable is the URL and there are a
> few
> >> fields stored against it - e.g. DateDiscovered, Host, Script,
> QueryString,
> >> etc.
> >>
> >> Each server has a hadoop datanode and an hbase regionserver and 1 of the
> >> servers additionally has the namenode, master and zookeeper.  On first
> >> start, each regionserver uses 2 Gigs (usedHeap) and as soon as I restart
> >> the clients, the usedHeap slowly climes until it reaches the maxHeap and
> >> shortly after that, the regionservers start crashing - sometimes they
> >> actually shutdown gracefully by themselves.
> >>
> >> Originally, we had hbase.regionserver.handler.count set to 100 and I
> have
> >> now removed that to leave it as default which has not helped.
> >>
> >> We have not made any changes to the clients and we have a mirrored
> >> instance of this in our UK Data Centre which is still running 0.20.6 and
> >> servicing 10 clients currently at over 300 requests per second (again
> low
> >> load over the holidays) and it is 100% stable.
> >>
> >> What do I do now? - your website says I cannot downgrade?
> >>
> >> Please help
> >>
> >> Regards,
> >> Seraph
> >>
> >>
> >>
> >>
> >>
> >>
>
>

Re: RAM Problems - Keeps Crashing

Posted by Seraph Imalia <se...@eisp.co.za>.

Region Servers

Address	                        Start Code	        Load
dynobuntu10:60030	1325081250180	requests=43, regions=224, usedHeap=3946, maxHeap=4087
dynobuntu12:60030	1325081249966	requests=32, regions=224, usedHeap=3821, maxHeap=4087
dynobuntu17:60030	1325081248407	requests=39, regions=225, usedHeap=4016, maxHeap=4087
Total:	servers: 3	 	requests=114, regions=673

I restarted them yesterday and the number of regions increased from 667 to 673 and they are about to run out of heap again :(.  Should I set that property to false? - what does mslab do? - is it new after 0.20.6?

Regards,
Seraph

On 28 Dec 2011, at 5:46 PM, Ted Yu wrote:

> Can you tell me how many regions each region server hosts ?
> 
> In 0.90.4 there is this parameter:
>    <name>hbase.hregion.memstore.mslab.enabled</name>
>    <value>true</value>
> mslab tends to consume heap if region count is high.
> 
> Cheers
> 
> On Wed, Dec 28, 2011 at 6:27 AM, Seraph Imalia <se...@eisp.co.za> wrote:
> 
>> Hi Guys,
>> 
>> After updating from 0.20.6 to 0.90.4, we have been having serious RAM
>> issues.  I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but with
>> 0.90.4 even 4.5 Gigs seems not enough.  It does not matter how much load
>> the hbase services are under, it just crashes after 24-48 hours.  The only
>> difference the load makes is how quickly the services crash.  Even over
>> this holiday season with our lowest load of the year, it crashes just after
>> 36 hours of being started.  To fix it, I have to run the stop-hbase.sh
>> command, wait a while and kill -9 any hbase processes that have stopped
>> outputting logs or stopped responding, and then run start-hbase.sh again.
>> 
>> Attached are my logs from the latest "start-to-crash".  There are 3
>> servers and hbase is being used for storing URL's - 7 client servers
>> connect to hbase and perform URL Lookups at about 40 requests per second
>> (this is the low load over this holiday season).  If the URL does not
>> exist, it gets added.  The Key on the HTable is the URL and there are a few
>> fields stored against it - e.g. DateDiscovered, Host, Script, QueryString,
>> etc.
>> 
>> Each server has a hadoop datanode and an hbase regionserver and 1 of the
>> servers additionally has the namenode, master and zookeeper.  On first
>> start, each regionserver uses 2 Gigs (usedHeap) and as soon as I restart
>> the clients, the usedHeap slowly climes until it reaches the maxHeap and
>> shortly after that, the regionservers start crashing - sometimes they
>> actually shutdown gracefully by themselves.
>> 
>> Originally, we had hbase.regionserver.handler.count set to 100 and I have
>> now removed that to leave it as default which has not helped.
>> 
>> We have not made any changes to the clients and we have a mirrored
>> instance of this in our UK Data Centre which is still running 0.20.6 and
>> servicing 10 clients currently at over 300 requests per second (again low
>> load over the holidays) and it is 100% stable.
>> 
>> What do I do now? - your website says I cannot downgrade?
>> 
>> Please help
>> 
>> Regards,
>> Seraph
>> 
>> 
>> 
>> 
>> 
>>

Re: RAM Problems - Keeps Crashing

Posted by Ted Yu <yu...@gmail.com>.

Can you tell me how many regions each region server hosts ?

In 0.90.4 there is this parameter:
    <name>hbase.hregion.memstore.mslab.enabled</name>
    <value>true</value>
mslab tends to consume heap if region count is high.

Cheers

On Wed, Dec 28, 2011 at 6:27 AM, Seraph Imalia <se...@eisp.co.za> wrote:

> Hi Guys,
>
> After updating from 0.20.6 to 0.90.4, we have been having serious RAM
> issues.  I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but with
> 0.90.4 even 4.5 Gigs seems not enough.  It does not matter how much load
> the hbase services are under, it just crashes after 24-48 hours.  The only
> difference the load makes is how quickly the services crash.  Even over
> this holiday season with our lowest load of the year, it crashes just after
> 36 hours of being started.  To fix it, I have to run the stop-hbase.sh
> command, wait a while and kill -9 any hbase processes that have stopped
> outputting logs or stopped responding, and then run start-hbase.sh again.
>
> Attached are my logs from the latest "start-to-crash".  There are 3
> servers and hbase is being used for storing URL's - 7 client servers
> connect to hbase and perform URL Lookups at about 40 requests per second
> (this is the low load over this holiday season).  If the URL does not
> exist, it gets added.  The Key on the HTable is the URL and there are a few
> fields stored against it - e.g. DateDiscovered, Host, Script, QueryString,
> etc.
>
> Each server has a hadoop datanode and an hbase regionserver and 1 of the
> servers additionally has the namenode, master and zookeeper.  On first
> start, each regionserver uses 2 Gigs (usedHeap) and as soon as I restart
> the clients, the usedHeap slowly climes until it reaches the maxHeap and
> shortly after that, the regionservers start crashing - sometimes they
> actually shutdown gracefully by themselves.
>
> Originally, we had hbase.regionserver.handler.count set to 100 and I have
> now removed that to leave it as default which has not helped.
>
> We have not made any changes to the clients and we have a mirrored
> instance of this in our UK Data Centre which is still running 0.20.6 and
> servicing 10 clients currently at over 300 requests per second (again low
> load over the holidays) and it is 100% stable.
>
> What do I do now? - your website says I cannot downgrade?
>
> Please help
>
> Regards,
> Seraph
>
>
>
>
>
>

Re: RAM Problems - Keeps Crashing

Posted by Seraph Imalia <se...@eisp.co.za>.

Hi Stack,

Thank you for that advice, I followed instructions exactly as specified in the link you sent so the files/nproc issue should be resolved now.

We will be changing the design of the table to shorten the key in the next few weeks we are just waiting for resources to make the changes.  Disabling the block cache did stop hbase from crashing every 24 hours and has been working for the past week.  Yesterday however, client lookup performance started dropping and I ended up restarting the hbase services 4 times - each time the performance was great and after about 1 or 2 hours the performance was slow again.  The Heap was fine with close to 1 Gig free Heap space per regionserver (and CPU/HDD usage was also normal so it did not seem to be a resource issue)

The only thing I could find that was unusual in the log files was on dynobuntu10 which kept showing the following warning every few minutes: IPC Server handler 'n' on 60020 caught: java.nio.channels.ClosedChannelException.   Additionally the other two regionservers were writing to the logs quite regularly but dynobuntu10 was not.

Re-enabling the block cache resolved the problem, but now I suspect my original problem will re-surface where I have to restart it every 24 hours.  It is also important to note that the above message is no longer logged when the block cache is re-enabled.

My long-winded question is therefore: Is the problem described above also related to the key being too long or have we stumbled upon another issue that needs to be addressed? - If it is related, then my goal right now is to get hbase relatively stable until we have the resources to change the table structure and any advice on keeping it stable for as long as possible would be appreciated (either an alternative to disabling the block cache - or by trying to address/prevent the above performance issue).

Here are the current Metrics on each regionserver - it is currently stable with no performance issues:

dynobuntu10: requests=62, regions=226, stores=226, storefiles=325, storefileIndexSize=1960, memstoreSize=972, compactionQueueSize=0, flushQueueSize=0, usedHeap=3537, maxHeap=4087, blockCacheSize=723034008, blockCacheFree=134216232, blockCacheCount=10914, blockCacheHitCount=2416681, blockCacheMissCount=3228779, blockCacheEvictedCount=250206, blockCacheHitRatio=42, blockCacheHitCachingRatio=90

dynobuntu12: requests=51, regions=226, stores=226, storefiles=311, storefileIndexSize=2229, memstoreSize=1001, compactionQueueSize=0, flushQueueSize=0, usedHeap=4008, maxHeap=4087, blockCacheSize=687937520, blockCacheFree=169312720, blockCacheCount=10384, blockCacheHitCount=2688299, blockCacheMissCount=3341337, blockCacheEvictedCount=211072, blockCacheHitRatio=44, blockCacheHitCachingRatio=92

dynobuntu17:  requests=60, regions=226, stores=226, storefiles=320, storefileIndexSize=2116, memstoreSize=1117, compactionQueueSize=0, flushQueueSize=0, usedHeap=3950, maxHeap=4087, blockCacheSize=657958152, blockCacheFree=199292088, blockCacheCount=10519, blockCacheHitCount=4083202, blockCacheMissCount=3426140, blockCacheEvictedCount=337502, blockCacheHitRatio=54, blockCacheHitCachingRatio=92

Regards,
Seraph

On 04 Jan 2012, at 8:35 PM, Stack wrote:

> I looked at your logs.  They show you running w/ default ulimit.
> Please fix (See the requirements section, in particular the ulimit and
> maxprocs section, http://hbase.apache.org/book.html#ulimit, but ensure
> you satisfy all requirements).  Let us know how it goes after you
> change this.
> St.Ack
> 
> 
> On Wed, Jan 4, 2012 at 4:45 AM, Seraph Imalia <se...@eisp.co.za> wrote:
>> 
>> On 03 Jan 2012, at 6:39 PM, Stack wrote:
>> 
>>> On Wed, Dec 28, 2011 at 6:27 AM, Seraph Imalia <se...@eisp.co.za> wrote:
>>>> After updating from 0.20.6 to 0.90.4, we have been having serious RAM issues.  I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but with 0.90.4 even 4.5 Gigs seems not enough.  It does not matter how much load the hbase services are under, it just crashes after 24-48 hours.
>>> 
>>> What kind of a 'crash' is it?  Is it OOME, or JVM seg faulting or just
>>> a full GC making the RS look like its gone away?
>> 
>> The crash seems slightly different each time (which I suppose is consistent with running out of RAM).  When our monitoring system alerts me to the problem and I log into the 3 servers, sometimes the regionservers on dynobuntu10 and dynobuntu12 have already shutdown and the last thing in their logs says that the Shutdown Hook finished.  The regionsever on dynobuntu17 (which also has the master running) is usually frozen with the last item in the log being 10-20 minutes prior.
>> 
>> I then run bin/stop-hbase.sh on the dynobuntu17: if the regionservers on dynobuntu10 or dynobuntu12 are still running, sometimes they shutdown gracefully whilst other times the logs just show Shutdown Hook Initiated and then nothing more.  The master then keeps logging which servers it is waiting on to shutdown.  I leave it like that for about 5-10 minutes allowing any processes that are still alive to do as much as they can before I do a kill -9.
>> 
>> That said, for the latest crash: when I logged in, the regionservers on dynobuntu10 and dynobuntu12 had shutdown already, and when I ran bin/stop-hbase.sh on the master, everything shutdown gracefully (kill -9 was not necessary) - this is the first time this has happened so effortlessly.
>> 
>>> 
>>>> The only difference the load makes is how quickly the services crash.  Even over this holiday season with our lowest load of the year, it crashes just after 36 hours of being started.  To fix it, I have to run the stop-hbase.sh command, wait a while and kill -9 any hbase processes that have stopped outputting logs or stopped responding, and then run start-hbase.sh again.
>>> 
>>> The process is deadlocked?   IIRC, 0.90.4 had a possible deadlock.
>>> You could try 0.90.5.
>> 
>> Sometimes yes, my answer above gives more detail.
>> Nice - didn't notice 0.90.5 had been released, I will try that next!
>> 
>>> 
>>> I took a look at some of the logs.  They do not run from server start
>>> because i do not see the ulimit output in there.  I'd like to see
>>> that.
>> 
>> Sorry, I see that now :(.  I have put the logs for the last two crashes here (it's 2.5 Megs):  https://rapidshare.com/files/4120740991/hbase-last-two-crashes-2012-01-03_2012-01-04.tgz
>> One crash was around 19:30 yesterday and the second was at 12:50 today.
>> 
>>> Looking at dynobuntu10, I see some interesting 'keys':
>>> 
>>> 2011-12-28 15:25:53,297 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request
>>> to open region:
>>> UrlIndex,http://www.hellopeter.com/write_report_preview.php?inclination=1&company=kalahari.net&countryid=168&location=cape
>>> town&industryid=14&person=&problem=out of stock&other=&headline=why
>>> advertise goods online and you cannot deliver%29&incident=i purchased
>>> goods online that were supposedly in stock on the 5th october. 2010.
>>> after numerous phone calls i was promised that i would receive the
>>> ordered goods on the 20th october 2010. this has not happened to date.
>>> i spoke with them today and they promised to answer my queries on
>>> 21st october2010. how can you run a online busines ans sell %22we dont
>>> have stock%22%3a this is the easy way out as we have no proof of
>>> that%0d%0ait is just common curtousy to return a phone call. they have
>>> had my money in their bank account for 15 days. this seems like a
>>> ****. they could be reaping interest on thousands of peoples money.
>>> easy way of making money.%0d%0akalahari. net are in a comfort zone.
>>> they need to realize that customers are king%0d%0athey reimburse my
>>> money. i paid bank charges and transfer fees. what about this. my
>>> unnessessary phone calls. do they reinburse this.%0d%0acome on stop
>>> taking the innocent public for a ride with your sweet
>>> talk.&incidentcharsleft=270&incident_day_select=21&incident_month_select=10&incident_year_select=2010&incident_hour_select=11&incident_min_select=45&incident_ampm_select=pm&policyno=3573210
>>> %2f3573310 &cellno=%2b27
>>> 766881896&preview=preview,1308921597915.1827414390
>>> 
>>> Thats a single key.  It looks like you have an issue in your crawler's
>>> url extraction facility.
>> 
>> Yeah, that URL actually exists, but I can see how that can be a problem to use as a key.  Not sure what to do here, perhaps we should exclude URL's like this - or perhaps your hashing idea below will solve that.  I don't really know enough about hashing to make the call though - is it not possible to run into duplicate keys using e.g. an MD5 Hash? - The MD5 Hash of the above URL is: 8f157d290ceeacedb6c1be133f1ca153 - it seems logical to me that a string that small cannot possibly be unique given that the URL was originally 1431 characters long.  What is your opinion on this? - I will be doing some more research on this though.  Perhaps there is a Hash-type that is more unique that you could suggest for our purposes (but keeping in mind our ad delivery servers will need to hash the URL before querying hBase so it needs to be fast and not resource intensive)?
>> 
>>> If you have lots of URLs like the above, my guess is that you have
>>> massive indices.  Look at a regionserver and see how much RAM the
>>> indexes take up?
>> 
>> Yeah, looks pretty high, it is currently half the MaxHeap on a fresh start...
>> 
>> Below is what it is now, I have just disabled the block cache after the last crash as you suggested to try keep it stable until we have a real fix.  With the Block cache at the default of 25% (1 Gig) and the IndexSize at around 2 Gigs, that only leaves 1 Gig for everything else :( which is not much.
>> 
>> dynobuntu10: requests=93, regions=225, stores=225, storefiles=354, storefileIndexSize=2239, memstoreSize=35, compactionQueueSize=0, flushQueueSize=0, usedHeap=2639, maxHeap=4087, blockCacheSize=0, blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, blockCacheHitCachingRatio=0
>> 
>> dynobuntu12: requests=305, regions=225, stores=225, storefiles=435, storefileIndexSize=2004, memstoreSize=31, compactionQueueSize=0, flushQueueSize=0, usedHeap=2321, maxHeap=4087, blockCacheSize=0, blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, blockCacheHitCachingRatio=0
>> 
>> dynobuntu17: requests=51, regions=226, stores=226, storefiles=410, storefileIndexSize=2046, memstoreSize=36, compactionQueueSize=0, flushQueueSize=0, usedHeap=2927, maxHeap=4087, blockCacheSize=0, blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, blockCacheHitCachingRatio=0
>> 
>>> 
>>> In dynoubuntu12 I see an OOME.  Interestingly, the OOME is while
>>> trying to read in a file's index on:
>>> 
>>> 2011-12-28 15:26:50,310 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegion: Opening region: REGION
>>> => {NAME => 'UrlIndex,http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript,1283006905877',
>>> STARTKEY => 'http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript',
>>> ENDKEY => 'http://media.imgbnr.com/images/prep_ct.php?imgfile=6966_567146_7571715_90_728.html&partnerid=113474&appid=35224&subid=&advertiserid=567146&keywordid=42825616&type=11&uuid=6178294088f545ab938c403be5b7c957&keyword=www.&matchedby=c&ct=cpi&wid=5008236&size=728x90&lid=7571715&cid=230615&cc=us&rc=ny&mc=501&dc=0&vt=1275772980357&refurl=worldstarhiphop.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript',
>>> ENCODED => 1246560666, TABLE => {{NAME => 'UrlIndex', INDEXES =>
>>> '  indexUrlUID  UrlIndex_Family:urluid=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator:com.entelligence.tools.hbase.index.UniqueIndexKeyGenerator UrlIndex_Family:urluid org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable indexHostLocations  UrlIndex_Family:host UrlIndex_Family:locationcodes=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator UrlIndex_Family:host org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable indexLocationCodes  UrlIndex_Family:locationcodes=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator UrlIndex_Family:locationcodes org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable   indexHost  UrlIndex_Family:host=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator UrlIndex_Family:host org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable indexChannelUIDs  UrlIndex_Family:channeluids=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator UrlIndex_Family:channeluids org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable',
>>> FAMILIES => [{NAME => 'UrlIndex_Family', query => '', datelastspidered
>>> => '', path => '', BLOOMFILTER => 'NONE', TTL => '2147483647',
>>> datenextspider => '', daylastaccesed => '', host => '', originalurl =>
>>> '', locationcodes => '', extension => '', IN_MEMORY => 'false',
>>> COMPRESSION => 'LZO', VERSIONS => '3', protocol => '', failCount =>
>>> '', datediscovered => '', contentDiff => '', BLOCKSIZE => '65536',
>>> datelastmodified => '', urluid => '', BLOCKCACHE => 'true',
>>> channeluids => ''}]}}
>>> 
>>> 
>>> What is that INDEXES thing in the above schema?  Is that some
>>> secondary indexing thing you have going on?
>> 
>> Yes, it is a secondary Index we created. Basically, we serve ads and ads are queued per URL.  When we discover a new URL, we add it to HBase and give it a GUID which is stored as a column.  Other servers build lists of ads for each URL and store it against the GUID of the URL.  So when a request comes in for ads, we use HBase to lookup the URL and get the GUID so that it can then know which ads to show.  This is all handled by the main URLIndex Table.  BUT, sometimes and far less often, we have a situation where we have the GUID but need to look up the URL - so we have another table where the GUID is the rowKey and the URL is a column.  We also do this for hosts and for Channels (this is what we call a place where ads show).
>> 
>>> 
>>> You might take a look at the files under UrlIndex/1246560666 in the
>>> UrlIndex column family..... Print out their meta data and see what
>>> size indices you have.   See
>>> http://hbase.apache.org/book.html#rowkey.design in the book.  It has
>>> some pointers and some talk on issues you may be running into.
>>> 
>> 
>> root@dynobuntu17:/opt/hadoop-0.20.2# bin/hadoop fs -ls /hbase/UrlIndex/1246560666/UrlIndex_Family
>> Found 1 items
>> -rw-r--r--   2 root supergroup  199115380 2010-09-10 01:52 /hbase/UrlIndex/1246560666/UrlIndex_Family/6442966743799940481
>> root@dynobuntu17:/opt/hadoop-0.20.2#
>> 
>> After reading that, it seems clear we need to make some minor changes to our Table Design, unfortunately it means creating a new table and copying rows to the new table - not a fun process because we can't be down whilst doing it, so we'll have to write some good code to ease the process - doable, but not fun.  I am hoping that upgrading the version to 0.90.5 and disabling the block cache buys us about a month so we have the time to plan it properly.
>> 
>>> 
>>>> Attached are my logs from the latest "start-to-crash".  There are 3 servers and hbase is being used for storing URL's - 7 client servers connect to hbase and perform URL Lookups at about 40 requests per second (this is the low load over this holiday season).  If the URL does not exist, it gets added.  The Key on the HTable is the URL and there are a few fields stored against it - e.g. DateDiscovered, Host, Script, QueryString, etc.
>>>> 
>>> 
>>> Do you have to scan the URLs in order or by website?  If not, you
>>> might have a key that is a hash of the URL (and keep actual URL as
>>> column data).
>> 
>> Yes sometimes we need to do scans like that - but only for a manual investigation, not during normal operation.  We may be able to get by as long as we can come up with a plan for how we can find the url's for a particular website.  I am concerned about the uniqueness of a hash.  I see there are lots of different hashes.  Will there be uniqueness issues? - we can't have two URL's having the same hash.
>> 
>>> 
>>>> Each server has a hadoop datanode and an hbase regionserver and 1 of the servers additionally has the namenode, master and zookeeper.  On first start, each regionserver uses 2 Gigs (usedHeap) and as soon as I restart the clients, the usedHeap slowly climes until it reaches the maxHeap and shortly after that, the regionservers start crashing - sometimes they actually shutdown gracefully by themselves.
>>>> 
>>> 
>>> 
>>> Are the URL lookups totally random?  If so, turn off the block cache.
>>> That'll get you some more memory.
>> 
>> Yes, pretty random and as we grow, it will get more random.  I have disabled the block cache - and looking at the heap stats which I pasted above, it seems like it will buy us some time to make some long-term changes - I will keep you updated here.
>> 
>>> 
>>> Add more servers too to spread the load if you can afford it.  Things
>>> tend to run smoother once you get above 5 servers or so.
>> 
>> We currently have 4 instances of HBase - 2 each have 5 servers and are used for Ad Delivery Log Storage and the other two are used for URL lookups and each have 3 servers.  I will struggle to get our finance guys to approve more servers for hBase, but if that is my only option I will definitely try :)
>> 
>> Coincidently, the 2 instances used for Ad Delivery log storage are down at the moment, but it is because we are having stability issues with Ubuntu Server - they periodically do a memory dump and shut down - even if nothing is running on them.  I have to tackle that problem too pretty soon.  In the meantime, MySQL is taking up the slack, but we will quickly run into performance issues if we don't fix that soon.  But anyway, at the moment I don't need your help with those servers because it does not seem to be hBase or hadoop causing the crashes.  I am tackling this URL Server problem first.
>> 
>>> 
>>>> Originally, we had hbase.regionserver.handler.count set to 100 and I have now removed that to leave it as default which has not helped.
>>>> 
>>>> We have not made any changes to the clients and we have a mirrored instance of this in our UK Data Centre which is still running 0.20.6 and servicing 10 clients currently at over 300 requests per second (again low load over the holidays) and it is 100% stable.
>>>> 
>>>> What do I do now? - your website says I cannot downgrade?
>>>> 
>>> 
>>> That is right.
>>> 
>>> Lets get this stable again.
>>> 
>>> St.Ack
>> 
>> Thanks for your help so far.  I have already disabled the block cache (which I am sure will show an immediate improvement) and I will schedule an upgrade of hBase to 0.90.5 during this week and then monitor it.  If you do have some knowledge about the uniqueness of an MD5 Hash, please share it with me if you have the time? - it will help me whilst I plan the changes we need to make to the table structure.
>> 
>> Regards,
>> Seraph
>>

Re: RAM Problems - Keeps Crashing

Posted by Stack <st...@duboce.net>.

I looked at your logs.  They show you running w/ default ulimit.
Please fix (See the requirements section, in particular the ulimit and
maxprocs section, http://hbase.apache.org/book.html#ulimit, but ensure
you satisfy all requirements).  Let us know how it goes after you
change this.
St.Ack


On Wed, Jan 4, 2012 at 4:45 AM, Seraph Imalia <se...@eisp.co.za> wrote:
>
> On 03 Jan 2012, at 6:39 PM, Stack wrote:
>
>> On Wed, Dec 28, 2011 at 6:27 AM, Seraph Imalia <se...@eisp.co.za> wrote:
>>> After updating from 0.20.6 to 0.90.4, we have been having serious RAM issues.  I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but with 0.90.4 even 4.5 Gigs seems not enough.  It does not matter how much load the hbase services are under, it just crashes after 24-48 hours.
>>
>> What kind of a 'crash' is it?  Is it OOME, or JVM seg faulting or just
>> a full GC making the RS look like its gone away?
>
> The crash seems slightly different each time (which I suppose is consistent with running out of RAM).  When our monitoring system alerts me to the problem and I log into the 3 servers, sometimes the regionservers on dynobuntu10 and dynobuntu12 have already shutdown and the last thing in their logs says that the Shutdown Hook finished.  The regionsever on dynobuntu17 (which also has the master running) is usually frozen with the last item in the log being 10-20 minutes prior.
>
> I then run bin/stop-hbase.sh on the dynobuntu17: if the regionservers on dynobuntu10 or dynobuntu12 are still running, sometimes they shutdown gracefully whilst other times the logs just show Shutdown Hook Initiated and then nothing more.  The master then keeps logging which servers it is waiting on to shutdown.  I leave it like that for about 5-10 minutes allowing any processes that are still alive to do as much as they can before I do a kill -9.
>
> That said, for the latest crash: when I logged in, the regionservers on dynobuntu10 and dynobuntu12 had shutdown already, and when I ran bin/stop-hbase.sh on the master, everything shutdown gracefully (kill -9 was not necessary) - this is the first time this has happened so effortlessly.
>
>>
>>>  The only difference the load makes is how quickly the services crash.  Even over this holiday season with our lowest load of the year, it crashes just after 36 hours of being started.  To fix it, I have to run the stop-hbase.sh command, wait a while and kill -9 any hbase processes that have stopped outputting logs or stopped responding, and then run start-hbase.sh again.
>>
>> The process is deadlocked?   IIRC, 0.90.4 had a possible deadlock.
>> You could try 0.90.5.
>
> Sometimes yes, my answer above gives more detail.
> Nice - didn't notice 0.90.5 had been released, I will try that next!
>
>>
>> I took a look at some of the logs.  They do not run from server start
>> because i do not see the ulimit output in there.  I'd like to see
>> that.
>
> Sorry, I see that now :(.  I have put the logs for the last two crashes here (it's 2.5 Megs):  https://rapidshare.com/files/4120740991/hbase-last-two-crashes-2012-01-03_2012-01-04.tgz
> One crash was around 19:30 yesterday and the second was at 12:50 today.
>
>> Looking at dynobuntu10, I see some interesting 'keys':
>>
>> 2011-12-28 15:25:53,297 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request
>> to open region:
>> UrlIndex,http://www.hellopeter.com/write_report_preview.php?inclination=1&company=kalahari.net&countryid=168&location=cape
>> town&industryid=14&person=&problem=out of stock&other=&headline=why
>> advertise goods online and you cannot deliver%29&incident=i purchased
>> goods online that were supposedly in stock on the 5th october. 2010.
>> after numerous phone calls i was promised that i would receive the
>> ordered goods on the 20th october 2010. this has not happened to date.
>> i spoke with them today and they promised to answer my queries on
>> 21st october2010. how can you run a online busines ans sell %22we dont
>> have stock%22%3a this is the easy way out as we have no proof of
>> that%0d%0ait is just common curtousy to return a phone call. they have
>> had my money in their bank account for 15 days. this seems like a
>> ****. they could be reaping interest on thousands of peoples money.
>> easy way of making money.%0d%0akalahari. net are in a comfort zone.
>> they need to realize that customers are king%0d%0athey reimburse my
>> money. i paid bank charges and transfer fees. what about this. my
>> unnessessary phone calls. do they reinburse this.%0d%0acome on stop
>> taking the innocent public for a ride with your sweet
>> talk.&incidentcharsleft=270&incident_day_select=21&incident_month_select=10&incident_year_select=2010&incident_hour_select=11&incident_min_select=45&incident_ampm_select=pm&policyno=3573210
>> %2f3573310 &cellno=%2b27
>> 766881896&preview=preview,1308921597915.1827414390
>>
>> Thats a single key.  It looks like you have an issue in your crawler's
>> url extraction facility.
>
> Yeah, that URL actually exists, but I can see how that can be a problem to use as a key.  Not sure what to do here, perhaps we should exclude URL's like this - or perhaps your hashing idea below will solve that.  I don't really know enough about hashing to make the call though - is it not possible to run into duplicate keys using e.g. an MD5 Hash? - The MD5 Hash of the above URL is: 8f157d290ceeacedb6c1be133f1ca153 - it seems logical to me that a string that small cannot possibly be unique given that the URL was originally 1431 characters long.  What is your opinion on this? - I will be doing some more research on this though.  Perhaps there is a Hash-type that is more unique that you could suggest for our purposes (but keeping in mind our ad delivery servers will need to hash the URL before querying hBase so it needs to be fast and not resource intensive)?
>
>> If you have lots of URLs like the above, my guess is that you have
>> massive indices.  Look at a regionserver and see how much RAM the
>> indexes take up?
>
> Yeah, looks pretty high, it is currently half the MaxHeap on a fresh start...
>
> Below is what it is now, I have just disabled the block cache after the last crash as you suggested to try keep it stable until we have a real fix.  With the Block cache at the default of 25% (1 Gig) and the IndexSize at around 2 Gigs, that only leaves 1 Gig for everything else :( which is not much.
>
> dynobuntu10: requests=93, regions=225, stores=225, storefiles=354, storefileIndexSize=2239, memstoreSize=35, compactionQueueSize=0, flushQueueSize=0, usedHeap=2639, maxHeap=4087, blockCacheSize=0, blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, blockCacheHitCachingRatio=0
>
> dynobuntu12: requests=305, regions=225, stores=225, storefiles=435, storefileIndexSize=2004, memstoreSize=31, compactionQueueSize=0, flushQueueSize=0, usedHeap=2321, maxHeap=4087, blockCacheSize=0, blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, blockCacheHitCachingRatio=0
>
> dynobuntu17: requests=51, regions=226, stores=226, storefiles=410, storefileIndexSize=2046, memstoreSize=36, compactionQueueSize=0, flushQueueSize=0, usedHeap=2927, maxHeap=4087, blockCacheSize=0, blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, blockCacheHitCachingRatio=0
>
>>
>> In dynoubuntu12 I see an OOME.  Interestingly, the OOME is while
>> trying to read in a file's index on:
>>
>> 2011-12-28 15:26:50,310 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: Opening region: REGION
>> => {NAME => 'UrlIndex,http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript,1283006905877',
>> STARTKEY => 'http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript',
>> ENDKEY => 'http://media.imgbnr.com/images/prep_ct.php?imgfile=6966_567146_7571715_90_728.html&partnerid=113474&appid=35224&subid=&advertiserid=567146&keywordid=42825616&type=11&uuid=6178294088f545ab938c403be5b7c957&keyword=www.&matchedby=c&ct=cpi&wid=5008236&size=728x90&lid=7571715&cid=230615&cc=us&rc=ny&mc=501&dc=0&vt=1275772980357&refurl=worldstarhiphop.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript',
>> ENCODED => 1246560666, TABLE => {{NAME => 'UrlIndex', INDEXES =>
>> '  indexUrlUID  UrlIndex_Family:urluid=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator:com.entelligence.tools.hbase.index.UniqueIndexKeyGenerator UrlIndex_Family:urluid org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable indexHostLocations  UrlIndex_Family:host UrlIndex_Family:locationcodes=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator UrlIndex_Family:host org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable indexLocationCodes  UrlIndex_Family:locationcodes=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator UrlIndex_Family:locationcodes org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable   indexHost  UrlIndex_Family:host=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator UrlIndex_Family:host org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable indexChannelUIDs  UrlIndex_Family:channeluids=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator UrlIndex_Family:channeluids org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable',
>> FAMILIES => [{NAME => 'UrlIndex_Family', query => '', datelastspidered
>> => '', path => '', BLOOMFILTER => 'NONE', TTL => '2147483647',
>> datenextspider => '', daylastaccesed => '', host => '', originalurl =>
>> '', locationcodes => '', extension => '', IN_MEMORY => 'false',
>> COMPRESSION => 'LZO', VERSIONS => '3', protocol => '', failCount =>
>> '', datediscovered => '', contentDiff => '', BLOCKSIZE => '65536',
>> datelastmodified => '', urluid => '', BLOCKCACHE => 'true',
>> channeluids => ''}]}}
>>
>>
>> What is that INDEXES thing in the above schema?  Is that some
>> secondary indexing thing you have going on?
>
> Yes, it is a secondary Index we created. Basically, we serve ads and ads are queued per URL.  When we discover a new URL, we add it to HBase and give it a GUID which is stored as a column.  Other servers build lists of ads for each URL and store it against the GUID of the URL.  So when a request comes in for ads, we use HBase to lookup the URL and get the GUID so that it can then know which ads to show.  This is all handled by the main URLIndex Table.  BUT, sometimes and far less often, we have a situation where we have the GUID but need to look up the URL - so we have another table where the GUID is the rowKey and the URL is a column.  We also do this for hosts and for Channels (this is what we call a place where ads show).
>
>>
>> You might take a look at the files under UrlIndex/1246560666 in the
>> UrlIndex column family..... Print out their meta data and see what
>> size indices you have.   See
>> http://hbase.apache.org/book.html#rowkey.design in the book.  It has
>> some pointers and some talk on issues you may be running into.
>>
>
> root@dynobuntu17:/opt/hadoop-0.20.2# bin/hadoop fs -ls /hbase/UrlIndex/1246560666/UrlIndex_Family
> Found 1 items
> -rw-r--r--   2 root supergroup  199115380 2010-09-10 01:52 /hbase/UrlIndex/1246560666/UrlIndex_Family/6442966743799940481
> root@dynobuntu17:/opt/hadoop-0.20.2#
>
> After reading that, it seems clear we need to make some minor changes to our Table Design, unfortunately it means creating a new table and copying rows to the new table - not a fun process because we can't be down whilst doing it, so we'll have to write some good code to ease the process - doable, but not fun.  I am hoping that upgrading the version to 0.90.5 and disabling the block cache buys us about a month so we have the time to plan it properly.
>
>>
>>> Attached are my logs from the latest "start-to-crash".  There are 3 servers and hbase is being used for storing URL's - 7 client servers connect to hbase and perform URL Lookups at about 40 requests per second (this is the low load over this holiday season).  If the URL does not exist, it gets added.  The Key on the HTable is the URL and there are a few fields stored against it - e.g. DateDiscovered, Host, Script, QueryString, etc.
>>>
>>
>> Do you have to scan the URLs in order or by website?  If not, you
>> might have a key that is a hash of the URL (and keep actual URL as
>> column data).
>
> Yes sometimes we need to do scans like that - but only for a manual investigation, not during normal operation.  We may be able to get by as long as we can come up with a plan for how we can find the url's for a particular website.  I am concerned about the uniqueness of a hash.  I see there are lots of different hashes.  Will there be uniqueness issues? - we can't have two URL's having the same hash.
>
>>
>>> Each server has a hadoop datanode and an hbase regionserver and 1 of the servers additionally has the namenode, master and zookeeper.  On first start, each regionserver uses 2 Gigs (usedHeap) and as soon as I restart the clients, the usedHeap slowly climes until it reaches the maxHeap and shortly after that, the regionservers start crashing - sometimes they actually shutdown gracefully by themselves.
>>>
>>
>>
>> Are the URL lookups totally random?  If so, turn off the block cache.
>> That'll get you some more memory.
>
> Yes, pretty random and as we grow, it will get more random.  I have disabled the block cache - and looking at the heap stats which I pasted above, it seems like it will buy us some time to make some long-term changes - I will keep you updated here.
>
>>
>> Add more servers too to spread the load if you can afford it.  Things
>> tend to run smoother once you get above 5 servers or so.
>
> We currently have 4 instances of HBase - 2 each have 5 servers and are used for Ad Delivery Log Storage and the other two are used for URL lookups and each have 3 servers.  I will struggle to get our finance guys to approve more servers for hBase, but if that is my only option I will definitely try :)
>
> Coincidently, the 2 instances used for Ad Delivery log storage are down at the moment, but it is because we are having stability issues with Ubuntu Server - they periodically do a memory dump and shut down - even if nothing is running on them.  I have to tackle that problem too pretty soon.  In the meantime, MySQL is taking up the slack, but we will quickly run into performance issues if we don't fix that soon.  But anyway, at the moment I don't need your help with those servers because it does not seem to be hBase or hadoop causing the crashes.  I am tackling this URL Server problem first.
>
>>
>>> Originally, we had hbase.regionserver.handler.count set to 100 and I have now removed that to leave it as default which has not helped.
>>>
>>> We have not made any changes to the clients and we have a mirrored instance of this in our UK Data Centre which is still running 0.20.6 and servicing 10 clients currently at over 300 requests per second (again low load over the holidays) and it is 100% stable.
>>>
>>> What do I do now? - your website says I cannot downgrade?
>>>
>>
>> That is right.
>>
>> Lets get this stable again.
>>
>> St.Ack
>
> Thanks for your help so far.  I have already disabled the block cache (which I am sure will show an immediate improvement) and I will schedule an upgrade of hBase to 0.90.5 during this week and then monitor it.  If you do have some knowledge about the uniqueness of an MD5 Hash, please share it with me if you have the time? - it will help me whilst I plan the changes we need to make to the table structure.
>
> Regards,
> Seraph
>

Re: RAM Problems - Keeps Crashing

Posted by Seraph Imalia <se...@eisp.co.za>.

On 03 Jan 2012, at 6:39 PM, Stack wrote:

> On Wed, Dec 28, 2011 at 6:27 AM, Seraph Imalia <se...@eisp.co.za> wrote:
>> After updating from 0.20.6 to 0.90.4, we have been having serious RAM issues.  I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but with 0.90.4 even 4.5 Gigs seems not enough.  It does not matter how much load the hbase services are under, it just crashes after 24-48 hours.
> 
> What kind of a 'crash' is it?  Is it OOME, or JVM seg faulting or just
> a full GC making the RS look like its gone away?

The crash seems slightly different each time (which I suppose is consistent with running out of RAM).  When our monitoring system alerts me to the problem and I log into the 3 servers, sometimes the regionservers on dynobuntu10 and dynobuntu12 have already shutdown and the last thing in their logs says that the Shutdown Hook finished.  The regionsever on dynobuntu17 (which also has the master running) is usually frozen with the last item in the log being 10-20 minutes prior.  

I then run bin/stop-hbase.sh on the dynobuntu17: if the regionservers on dynobuntu10 or dynobuntu12 are still running, sometimes they shutdown gracefully whilst other times the logs just show Shutdown Hook Initiated and then nothing more.  The master then keeps logging which servers it is waiting on to shutdown.  I leave it like that for about 5-10 minutes allowing any processes that are still alive to do as much as they can before I do a kill -9.

That said, for the latest crash: when I logged in, the regionservers on dynobuntu10 and dynobuntu12 had shutdown already, and when I ran bin/stop-hbase.sh on the master, everything shutdown gracefully (kill -9 was not necessary) - this is the first time this has happened so effortlessly.

> 
>>  The only difference the load makes is how quickly the services crash.  Even over this holiday season with our lowest load of the year, it crashes just after 36 hours of being started.  To fix it, I have to run the stop-hbase.sh command, wait a while and kill -9 any hbase processes that have stopped outputting logs or stopped responding, and then run start-hbase.sh again.
> 
> The process is deadlocked?   IIRC, 0.90.4 had a possible deadlock.
> You could try 0.90.5.

Sometimes yes, my answer above gives more detail.  
Nice - didn't notice 0.90.5 had been released, I will try that next!

> 
> I took a look at some of the logs.  They do not run from server start
> because i do not see the ulimit output in there.  I'd like to see
> that.

Sorry, I see that now :(.  I have put the logs for the last two crashes here (it's 2.5 Megs):  https://rapidshare.com/files/4120740991/hbase-last-two-crashes-2012-01-03_2012-01-04.tgz
One crash was around 19:30 yesterday and the second was at 12:50 today.

> Looking at dynobuntu10, I see some interesting 'keys':
> 
> 2011-12-28 15:25:53,297 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request
> to open region:
> UrlIndex,http://www.hellopeter.com/write_report_preview.php?inclination=1&company=kalahari.net&countryid=168&location=cape
> town&industryid=14&person=&problem=out of stock&other=&headline=why 
> advertise goods online and you cannot deliver%29&incident=i purchased
> goods online that were supposedly in stock on the 5th october. 2010.
> after numerous phone calls i was promised that i would receive the
> ordered goods on the 20th october 2010. this has not happened to date.
> i spoke with them today and they promised to answer my queries on
> 21st october2010. how can you run a online busines ans sell %22we dont
> have stock%22%3a this is the easy way out as we have no proof of
> that%0d%0ait is just common curtousy to return a phone call. they have
> had my money in their bank account for 15 days. this seems like a
> ****. they could be reaping interest on thousands of peoples money.
> easy way of making money.%0d%0akalahari. net are in a comfort zone.
> they need to realize that customers are king%0d%0athey reimburse my
> money. i paid bank charges and transfer fees. what about this. my
> unnessessary phone calls. do they reinburse this.%0d%0acome on stop
> taking the innocent public for a ride with your sweet
> talk.&incidentcharsleft=270&incident_day_select=21&incident_month_select=10&incident_year_select=2010&incident_hour_select=11&incident_min_select=45&incident_ampm_select=pm&policyno=3573210
> %2f3573310 &cellno=%2b27
> 766881896&preview=preview,1308921597915.1827414390
> 
> Thats a single key.  It looks like you have an issue in your crawler's
> url extraction facility.

Yeah, that URL actually exists, but I can see how that can be a problem to use as a key.  Not sure what to do here, perhaps we should exclude URL's like this - or perhaps your hashing idea below will solve that.  I don't really know enough about hashing to make the call though - is it not possible to run into duplicate keys using e.g. an MD5 Hash? - The MD5 Hash of the above URL is: 8f157d290ceeacedb6c1be133f1ca153 - it seems logical to me that a string that small cannot possibly be unique given that the URL was originally 1431 characters long.  What is your opinion on this? - I will be doing some more research on this though.  Perhaps there is a Hash-type that is more unique that you could suggest for our purposes (but keeping in mind our ad delivery servers will need to hash the URL before querying hBase so it needs to be fast and not resource intensive)?

> If you have lots of URLs like the above, my guess is that you have
> massive indices.  Look at a regionserver and see how much RAM the
> indexes take up?

Yeah, looks pretty high, it is currently half the MaxHeap on a fresh start... 

Below is what it is now, I have just disabled the block cache after the last crash as you suggested to try keep it stable until we have a real fix.  With the Block cache at the default of 25% (1 Gig) and the IndexSize at around 2 Gigs, that only leaves 1 Gig for everything else :( which is not much.

dynobuntu10: requests=93, regions=225, stores=225, storefiles=354, storefileIndexSize=2239, memstoreSize=35, compactionQueueSize=0, flushQueueSize=0, usedHeap=2639, maxHeap=4087, blockCacheSize=0, blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, blockCacheHitCachingRatio=0

dynobuntu12: requests=305, regions=225, stores=225, storefiles=435, storefileIndexSize=2004, memstoreSize=31, compactionQueueSize=0, flushQueueSize=0, usedHeap=2321, maxHeap=4087, blockCacheSize=0, blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, blockCacheHitCachingRatio=0

dynobuntu17: requests=51, regions=226, stores=226, storefiles=410, storefileIndexSize=2046, memstoreSize=36, compactionQueueSize=0, flushQueueSize=0, usedHeap=2927, maxHeap=4087, blockCacheSize=0, blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, blockCacheHitCachingRatio=0

> 
> In dynoubuntu12 I see an OOME.  Interestingly, the OOME is while
> trying to read in a file's index on:
> 
> 2011-12-28 15:26:50,310 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Opening region: REGION
> => {NAME => 'UrlIndex,http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript,1283006905877',
> STARTKEY => 'http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript',
> ENDKEY => 'http://media.imgbnr.com/images/prep_ct.php?imgfile=6966_567146_7571715_90_728.html&partnerid=113474&appid=35224&subid=&advertiserid=567146&keywordid=42825616&type=11&uuid=6178294088f545ab938c403be5b7c957&keyword=www.&matchedby=c&ct=cpi&wid=5008236&size=728x90&lid=7571715&cid=230615&cc=us&rc=ny&mc=501&dc=0&vt=1275772980357&refurl=worldstarhiphop.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript',
> ENCODED => 1246560666, TABLE => {{NAME => 'UrlIndex', INDEXES =>
> 'indexUrlUIDUrlIndex_Family:urluid=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator:com.entelligence.tools.hbase.index.UniqueIndexKeyGeneratorUrlIndex_Family:urluidorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparableindexHostLocationsUrlIndex_Family:hostUrlIndex_Family:locationcodes=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:hostorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparableindexLocationCodesUrlIndex_Family:locationcodes=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:locationcodesorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable	indexHostUrlIndex_Family:host=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:hostorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparableindexChannelUIDsUrlIndex_Family:channeluids=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:channeluidsorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable',
> FAMILIES => [{NAME => 'UrlIndex_Family', query => '', datelastspidered
> => '', path => '', BLOOMFILTER => 'NONE', TTL => '2147483647',
> datenextspider => '', daylastaccesed => '', host => '', originalurl =>
> '', locationcodes => '', extension => '', IN_MEMORY => 'false',
> COMPRESSION => 'LZO', VERSIONS => '3', protocol => '', failCount =>
> '', datediscovered => '', contentDiff => '', BLOCKSIZE => '65536',
> datelastmodified => '', urluid => '', BLOCKCACHE => 'true',
> channeluids => ''}]}}
> 
> 
> What is that INDEXES thing in the above schema?  Is that some
> secondary indexing thing you have going on?

Yes, it is a secondary Index we created. Basically, we serve ads and ads are queued per URL.  When we discover a new URL, we add it to HBase and give it a GUID which is stored as a column.  Other servers build lists of ads for each URL and store it against the GUID of the URL.  So when a request comes in for ads, we use HBase to lookup the URL and get the GUID so that it can then know which ads to show.  This is all handled by the main URLIndex Table.  BUT, sometimes and far less often, we have a situation where we have the GUID but need to look up the URL - so we have another table where the GUID is the rowKey and the URL is a column.  We also do this for hosts and for Channels (this is what we call a place where ads show).

> 
> You might take a look at the files under UrlIndex/1246560666 in the
> UrlIndex column family..... Print out their meta data and see what
> size indices you have.   See
> http://hbase.apache.org/book.html#rowkey.design in the book.  It has
> some pointers and some talk on issues you may be running into.
> 

root@dynobuntu17:/opt/hadoop-0.20.2# bin/hadoop fs -ls /hbase/UrlIndex/1246560666/UrlIndex_Family
Found 1 items
-rw-r--r--   2 root supergroup  199115380 2010-09-10 01:52 /hbase/UrlIndex/1246560666/UrlIndex_Family/6442966743799940481
root@dynobuntu17:/opt/hadoop-0.20.2# 

After reading that, it seems clear we need to make some minor changes to our Table Design, unfortunately it means creating a new table and copying rows to the new table - not a fun process because we can't be down whilst doing it, so we'll have to write some good code to ease the process - doable, but not fun.  I am hoping that upgrading the version to 0.90.5 and disabling the block cache buys us about a month so we have the time to plan it properly.

> 
>> Attached are my logs from the latest "start-to-crash".  There are 3 servers and hbase is being used for storing URL's - 7 client servers connect to hbase and perform URL Lookups at about 40 requests per second (this is the low load over this holiday season).  If the URL does not exist, it gets added.  The Key on the HTable is the URL and there are a few fields stored against it - e.g. DateDiscovered, Host, Script, QueryString, etc.
>> 
> 
> Do you have to scan the URLs in order or by website?  If not, you
> might have a key that is a hash of the URL (and keep actual URL as
> column data).

Yes sometimes we need to do scans like that - but only for a manual investigation, not during normal operation.  We may be able to get by as long as we can come up with a plan for how we can find the url's for a particular website.  I am concerned about the uniqueness of a hash.  I see there are lots of different hashes.  Will there be uniqueness issues? - we can't have two URL's having the same hash.

> 
>> Each server has a hadoop datanode and an hbase regionserver and 1 of the servers additionally has the namenode, master and zookeeper.  On first start, each regionserver uses 2 Gigs (usedHeap) and as soon as I restart the clients, the usedHeap slowly climes until it reaches the maxHeap and shortly after that, the regionservers start crashing - sometimes they actually shutdown gracefully by themselves.
>> 
> 
> 
> Are the URL lookups totally random?  If so, turn off the block cache.
> That'll get you some more memory.

Yes, pretty random and as we grow, it will get more random.  I have disabled the block cache - and looking at the heap stats which I pasted above, it seems like it will buy us some time to make some long-term changes - I will keep you updated here.

> 
> Add more servers too to spread the load if you can afford it.  Things
> tend to run smoother once you get above 5 servers or so.

We currently have 4 instances of HBase - 2 each have 5 servers and are used for Ad Delivery Log Storage and the other two are used for URL lookups and each have 3 servers.  I will struggle to get our finance guys to approve more servers for hBase, but if that is my only option I will definitely try :)

Coincidently, the 2 instances used for Ad Delivery log storage are down at the moment, but it is because we are having stability issues with Ubuntu Server - they periodically do a memory dump and shut down - even if nothing is running on them.  I have to tackle that problem too pretty soon.  In the meantime, MySQL is taking up the slack, but we will quickly run into performance issues if we don't fix that soon.  But anyway, at the moment I don't need your help with those servers because it does not seem to be hBase or hadoop causing the crashes.  I am tackling this URL Server problem first.

> 
>> Originally, we had hbase.regionserver.handler.count set to 100 and I have now removed that to leave it as default which has not helped.
>> 
>> We have not made any changes to the clients and we have a mirrored instance of this in our UK Data Centre which is still running 0.20.6 and servicing 10 clients currently at over 300 requests per second (again low load over the holidays) and it is 100% stable.
>> 
>> What do I do now? - your website says I cannot downgrade?
>> 
> 
> That is right.
> 
> Lets get this stable again.
> 
> St.Ack

Thanks for your help so far.  I have already disabled the block cache (which I am sure will show an immediate improvement) and I will schedule an upgrade of hBase to 0.90.5 during this week and then monitor it.  If you do have some knowledge about the uniqueness of an MD5 Hash, please share it with me if you have the time? - it will help me whilst I plan the changes we need to make to the table structure.

Regards,
Seraph

Re: RAM Problems - Keeps Crashing

Posted by Stack <st...@duboce.net>.

On Wed, Dec 28, 2011 at 6:27 AM, Seraph Imalia <se...@eisp.co.za> wrote:
> After updating from 0.20.6 to 0.90.4, we have been having serious RAM issues.  I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but with 0.90.4 even 4.5 Gigs seems not enough.  It does not matter how much load the hbase services are under, it just crashes after 24-48 hours.

What kind of a 'crash' is it?  Is it OOME, or JVM seg faulting or just
a full GC making the RS look like its gone away?

>  The only difference the load makes is how quickly the services crash.  Even over this holiday season with our lowest load of the year, it crashes just after 36 hours of being started.  To fix it, I have to run the stop-hbase.sh command, wait a while and kill -9 any hbase processes that have stopped outputting logs or stopped responding, and then run start-hbase.sh again.
>

The process is deadlocked?   IIRC, 0.90.4 had a possible deadlock.
You could try 0.90.5.

I took a look at some of the logs.  They do not run from server start
because i do not see the ulimit output in there.  I'd like to see
that.

Looking at dynobuntu10, I see some interesting 'keys':

2011-12-28 15:25:53,297 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Received request
to open region:
UrlIndex,http://www.hellopeter.com/write_report_preview.php?inclination=1&company=kalahari.net&countryid=168&location=cape
town&industryid=14&person=&problem=out of stock&other=&headline=why
advertise goods online and you cannot deliver%29&incident=i purchased
goods online that were supposedly in stock on the 5th october. 2010.
after numerous phone calls i was promised that i would receive the
ordered goods on the 20th october 2010. this has not happened to date.
i spoke with them today and they promised to answer my queries on
21st october2010. how can you run a online busines ans sell %22we dont
have stock%22%3a this is the easy way out as we have no proof of
that%0d%0ait is just common curtousy to return a phone call. they have
had my money in their bank account for 15 days. this seems like a
****. they could be reaping interest on thousands of peoples money.
easy way of making money.%0d%0akalahari. net are in a comfort zone.
they need to realize that customers are king%0d%0athey reimburse my
money. i paid bank charges and transfer fees. what about this. my
unnessessary phone calls. do they reinburse this.%0d%0acome on stop
taking the innocent public for a ride with your sweet
talk.&incidentcharsleft=270&incident_day_select=21&incident_month_select=10&incident_year_select=2010&incident_hour_select=11&incident_min_select=45&incident_ampm_select=pm&policyno=3573210
%2f3573310 &cellno=%2b27
766881896&preview=preview,1308921597915.1827414390

Thats a single key.  It looks like you have an issue in your crawler's
url extraction facility.

If you have lots of URLs like the above, my guess is that you have
massive indices.  Look at a regionserver and see how much RAM the
indexes take up?

In dynoubuntu12 I see an OOME.  Interestingly, the OOME is while
trying to read in a file's index on:

2011-12-28 15:26:50,310 DEBUG
org.apache.hadoop.hbase.regionserver.HRegion: Opening region: REGION
=> {NAME => 'UrlIndex,http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript,1283006905877',
STARTKEY => 'http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript',
ENDKEY => 'http://media.imgbnr.com/images/prep_ct.php?imgfile=6966_567146_7571715_90_728.html&partnerid=113474&appid=35224&subid=&advertiserid=567146&keywordid=42825616&type=11&uuid=6178294088f545ab938c403be5b7c957&keyword=www.&matchedby=c&ct=cpi&wid=5008236&size=728x90&lid=7571715&cid=230615&cc=us&rc=ny&mc=501&dc=0&vt=1275772980357&refurl=worldstarhiphop.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript',
ENCODED => 1246560666, TABLE => {{NAME => 'UrlIndex', INDEXES =>
'indexUrlUIDUrlIndex_Family:urluid=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator:com.entelligence.tools.hbase.index.UniqueIndexKeyGeneratorUrlIndex_Family:urluidorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparableindexHostLocationsUrlIndex_Family:hostUrlIndex_Family:locationcodes=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:hostorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparableindexLocationCodesUrlIndex_Family:locationcodes=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:locationcodesorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable	indexHostUrlIndex_Family:host=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:hostorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparableindexChannelUIDsUrlIndex_Family:channeluids=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:channeluidsorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable',
FAMILIES => [{NAME => 'UrlIndex_Family', query => '', datelastspidered
=> '', path => '', BLOOMFILTER => 'NONE', TTL => '2147483647',
datenextspider => '', daylastaccesed => '', host => '', originalurl =>
'', locationcodes => '', extension => '', IN_MEMORY => 'false',
COMPRESSION => 'LZO', VERSIONS => '3', protocol => '', failCount =>
'', datediscovered => '', contentDiff => '', BLOCKSIZE => '65536',
datelastmodified => '', urluid => '', BLOCKCACHE => 'true',
channeluids => ''}]}}

What is that INDEXES thing in the above schema?  Is that some
secondary indexing thing you have going on?

You might take a look at the files under UrlIndex/1246560666 in the
UrlIndex column family..... Print out their meta data and see what
size indices you have.   See
http://hbase.apache.org/book.html#rowkey.design in the book.  It has
some pointers and some talk on issues you may be running into.

> Attached are my logs from the latest "start-to-crash".  There are 3 servers and hbase is being used for storing URL's - 7 client servers connect to hbase and perform URL Lookups at about 40 requests per second (this is the low load over this holiday season).  If the URL does not exist, it gets added.  The Key on the HTable is the URL and there are a few fields stored against it - e.g. DateDiscovered, Host, Script, QueryString, etc.
>

Do you have to scan the URLs in order or by website?  If not, you
might have a key that is a hash of the URL (and keep actual URL as
column data).

> Each server has a hadoop datanode and an hbase regionserver and 1 of the servers additionally has the namenode, master and zookeeper.  On first start, each regionserver uses 2 Gigs (usedHeap) and as soon as I restart the clients, the usedHeap slowly climes until it reaches the maxHeap and shortly after that, the regionservers start crashing - sometimes they actually shutdown gracefully by themselves.
>

Are the URL lookups totally random?  If so, turn off the block cache.
That'll get you some more memory.

Add more servers too to spread the load if you can afford it.  Things
tend to run smoother once you get above 5 servers or so.

> Originally, we had hbase.regionserver.handler.count set to 100 and I have now removed that to leave it as default which has not helped.
>
> We have not made any changes to the clients and we have a mirrored instance of this in our UK Data Centre which is still running 0.20.6 and servicing 10 clients currently at over 300 requests per second (again low load over the holidays) and it is 100% stable.
>
> What do I do now? - your website says I cannot downgrade?
>

That is right.

Lets get this stable again.

St.Ack