You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Homer Strong <ho...@gmail.com> on 2011/12/16 22:57:38 UTC

RS memory leak?

Hi HBasistas,

We're experiencing the following behavior when starting our cluster
with version 0.90.4-cdh3u2:

Whenever a RS is assigned a large (> 500-600) number of regions, the
heap usage grows without bound. Then the RS constantly GCs and must be
killed.

This is with 2000 regions over 3 RSs, with 10 GB heap. RSs have EC2
xlarges. Master is on its own large. Datanodes and namenodes are
adjacent to RSs and master, respectively.

Looks like a memory leak? Any suggestions would be appreciated.

Thanks,
Homer

Re: RS memory leak?

Posted by Homer Strong <ho...@gmail.com>.
After our weekend struggle, we ended up just dropping some tables that
we can rebuild with MR. Planning to merge smaller regions in the
immediate future. With fewer regions, the cluster started with no
issues.

Thanks for your suggestions!


On Sat, Dec 17, 2011 at 3:16 PM, Stack <st...@duboce.net> wrote:
> On Sat, Dec 17, 2011 at 1:29 PM, Homer Strong <ho...@gmail.com> wrote:
>> @Stack, we tried your suggestion for getting off the ground with an
>> extra RS. We added 1 more identical RS, and after balancing, killed
>> the extra one. The cluster remained stable for the night, but this
>> morning all 3 of our RSs had OOMs.
>>
>
> Sounds like you need more than 3 regionservers for your current load.
> Run with 4 or 5 for a while and use the time to work on merging your
> regions down to a smaller number -- run with many fewer per
> regionserver (make your regions bigger) -- and figure why you are
> getting the OOME.
>
> What do you see occupying memory in the regionserver?   You have 700
> or so regions per server?  You have a block cache of what size?  And
> the indexes for storefiles are taking up how much heap (Do you have
> wide keys?)  Are the cells large?
>
> You disabled swap but is your memory overcommitted: i.e. if you add up
> all used by all processes on the box is it greater than physical
> memory in size?
>
>
>
>> In the logs we find many entries like
>>
>> https://gist.github.com/eadb953fcadbeb302143
>>
>> Followed by the RSs aborting due to OOMs. Could this maybe be subject
>> to HBASE-4222?
>>
>
> Whats happening on the datanodes?  E.g. 10.192.21.220:50010?  Look in
> its logs?  Why is regionserver failing to sync?  See if you can figure
> it.
>
> St.Ack
>
>> Thanks for your help!
>>
>>
>> On Fri, Dec 16, 2011 at 3:31 PM, Homer Strong <ho...@gmail.com> wrote:
>>> Thanks for the response! To add to our problem's description: it
>>> doesn't seem like an absolute number of regions that triggers the
>>> memory overuse, we've seen it happen now with a wide range of region
>>> counts.
>>>
>>>> Just opening regions, it does this?
>>> Yes.
>>>
>>>> No load?
>>> Very low load, no requests.
>>>
>>>> No swapping?
>>> Swapping is disabled.
>>>
>>>
>>>> Bring up more xlarge instances and see if gets you off the ground?
>>>> Then work on getting your number of regions down in number?
>>> We'll try this and get back in a couple minutes!
>>>
>>>
>>>
>>> On Fri, Dec 16, 2011 at 3:21 PM, Stack <st...@duboce.net> wrote:
>>>> On Fri, Dec 16, 2011 at 1:57 PM, Homer Strong <ho...@gmail.com> wrote:
>>>>> Whenever a RS is assigned a large (> 500-600) number of regions, the
>>>>> heap usage grows without bound. Then the RS constantly GCs and must be
>>>>> killed.
>>>>>
>>>>
>>>> Just opening regions, it does this?
>>>>
>>>> No load?
>>>>
>>>> No swapping?
>>>>
>>>> What JVM and what args for JVM?
>>>>
>>>>
>>>>> This is with 2000 regions over 3 RSs, with 10 GB heap. RSs have EC2
>>>>> xlarges. Master is on its own large. Datanodes and namenodes are
>>>>> adjacent to RSs and master, respectively.
>>>>>
>>>>> Looks like a memory leak? Any suggestions would be appreciated.
>>>>>
>>>>
>>>> Bring up more xlarge instances and see if gets you off the ground?
>>>> Then work on getting your number of regions down in number?
>>>>
>>>> St.Ack

Re: RS memory leak?

Posted by Stack <st...@duboce.net>.
On Sat, Dec 17, 2011 at 1:29 PM, Homer Strong <ho...@gmail.com> wrote:
> @Stack, we tried your suggestion for getting off the ground with an
> extra RS. We added 1 more identical RS, and after balancing, killed
> the extra one. The cluster remained stable for the night, but this
> morning all 3 of our RSs had OOMs.
>

Sounds like you need more than 3 regionservers for your current load.
Run with 4 or 5 for a while and use the time to work on merging your
regions down to a smaller number -- run with many fewer per
regionserver (make your regions bigger) -- and figure why you are
getting the OOME.

What do you see occupying memory in the regionserver?   You have 700
or so regions per server?  You have a block cache of what size?  And
the indexes for storefiles are taking up how much heap (Do you have
wide keys?)  Are the cells large?

You disabled swap but is your memory overcommitted: i.e. if you add up
all used by all processes on the box is it greater than physical
memory in size?



> In the logs we find many entries like
>
> https://gist.github.com/eadb953fcadbeb302143
>
> Followed by the RSs aborting due to OOMs. Could this maybe be subject
> to HBASE-4222?
>

Whats happening on the datanodes?  E.g. 10.192.21.220:50010?  Look in
its logs?  Why is regionserver failing to sync?  See if you can figure
it.

St.Ack

> Thanks for your help!
>
>
> On Fri, Dec 16, 2011 at 3:31 PM, Homer Strong <ho...@gmail.com> wrote:
>> Thanks for the response! To add to our problem's description: it
>> doesn't seem like an absolute number of regions that triggers the
>> memory overuse, we've seen it happen now with a wide range of region
>> counts.
>>
>>> Just opening regions, it does this?
>> Yes.
>>
>>> No load?
>> Very low load, no requests.
>>
>>> No swapping?
>> Swapping is disabled.
>>
>>
>>> Bring up more xlarge instances and see if gets you off the ground?
>>> Then work on getting your number of regions down in number?
>> We'll try this and get back in a couple minutes!
>>
>>
>>
>> On Fri, Dec 16, 2011 at 3:21 PM, Stack <st...@duboce.net> wrote:
>>> On Fri, Dec 16, 2011 at 1:57 PM, Homer Strong <ho...@gmail.com> wrote:
>>>> Whenever a RS is assigned a large (> 500-600) number of regions, the
>>>> heap usage grows without bound. Then the RS constantly GCs and must be
>>>> killed.
>>>>
>>>
>>> Just opening regions, it does this?
>>>
>>> No load?
>>>
>>> No swapping?
>>>
>>> What JVM and what args for JVM?
>>>
>>>
>>>> This is with 2000 regions over 3 RSs, with 10 GB heap. RSs have EC2
>>>> xlarges. Master is on its own large. Datanodes and namenodes are
>>>> adjacent to RSs and master, respectively.
>>>>
>>>> Looks like a memory leak? Any suggestions would be appreciated.
>>>>
>>>
>>> Bring up more xlarge instances and see if gets you off the ground?
>>> Then work on getting your number of regions down in number?
>>>
>>> St.Ack

Re: RS memory leak?

Posted by Homer Strong <ho...@gmail.com>.
@Stack, we tried your suggestion for getting off the ground with an
extra RS. We added 1 more identical RS, and after balancing, killed
the extra one. The cluster remained stable for the night, but this
morning all 3 of our RSs had OOMs.

In the logs we find many entries like

https://gist.github.com/eadb953fcadbeb302143

Followed by the RSs aborting due to OOMs. Could this maybe be subject
to HBASE-4222?

Thanks for your help!


On Fri, Dec 16, 2011 at 3:31 PM, Homer Strong <ho...@gmail.com> wrote:
> Thanks for the response! To add to our problem's description: it
> doesn't seem like an absolute number of regions that triggers the
> memory overuse, we've seen it happen now with a wide range of region
> counts.
>
>> Just opening regions, it does this?
> Yes.
>
>> No load?
> Very low load, no requests.
>
>> No swapping?
> Swapping is disabled.
>
>
>> Bring up more xlarge instances and see if gets you off the ground?
>> Then work on getting your number of regions down in number?
> We'll try this and get back in a couple minutes!
>
>
>
> On Fri, Dec 16, 2011 at 3:21 PM, Stack <st...@duboce.net> wrote:
>> On Fri, Dec 16, 2011 at 1:57 PM, Homer Strong <ho...@gmail.com> wrote:
>>> Whenever a RS is assigned a large (> 500-600) number of regions, the
>>> heap usage grows without bound. Then the RS constantly GCs and must be
>>> killed.
>>>
>>
>> Just opening regions, it does this?
>>
>> No load?
>>
>> No swapping?
>>
>> What JVM and what args for JVM?
>>
>>
>>> This is with 2000 regions over 3 RSs, with 10 GB heap. RSs have EC2
>>> xlarges. Master is on its own large. Datanodes and namenodes are
>>> adjacent to RSs and master, respectively.
>>>
>>> Looks like a memory leak? Any suggestions would be appreciated.
>>>
>>
>> Bring up more xlarge instances and see if gets you off the ground?
>> Then work on getting your number of regions down in number?
>>
>> St.Ack

Re: RS memory leak?

Posted by Homer Strong <ho...@gmail.com>.
Thanks for the response! To add to our problem's description: it
doesn't seem like an absolute number of regions that triggers the
memory overuse, we've seen it happen now with a wide range of region
counts.

> Just opening regions, it does this?
Yes.

> No load?
Very low load, no requests.

> No swapping?
Swapping is disabled.


> Bring up more xlarge instances and see if gets you off the ground?
> Then work on getting your number of regions down in number?
We'll try this and get back in a couple minutes!



On Fri, Dec 16, 2011 at 3:21 PM, Stack <st...@duboce.net> wrote:
> On Fri, Dec 16, 2011 at 1:57 PM, Homer Strong <ho...@gmail.com> wrote:
>> Whenever a RS is assigned a large (> 500-600) number of regions, the
>> heap usage grows without bound. Then the RS constantly GCs and must be
>> killed.
>>
>
> Just opening regions, it does this?
>
> No load?
>
> No swapping?
>
> What JVM and what args for JVM?
>
>
>> This is with 2000 regions over 3 RSs, with 10 GB heap. RSs have EC2
>> xlarges. Master is on its own large. Datanodes and namenodes are
>> adjacent to RSs and master, respectively.
>>
>> Looks like a memory leak? Any suggestions would be appreciated.
>>
>
> Bring up more xlarge instances and see if gets you off the ground?
> Then work on getting your number of regions down in number?
>
> St.Ack

Re: RS memory leak?

Posted by Stack <st...@duboce.net>.
On Fri, Dec 16, 2011 at 1:57 PM, Homer Strong <ho...@gmail.com> wrote:
> Whenever a RS is assigned a large (> 500-600) number of regions, the
> heap usage grows without bound. Then the RS constantly GCs and must be
> killed.
>

Just opening regions, it does this?

No load?

No swapping?

What JVM and what args for JVM?


> This is with 2000 regions over 3 RSs, with 10 GB heap. RSs have EC2
> xlarges. Master is on its own large. Datanodes and namenodes are
> adjacent to RSs and master, respectively.
>
> Looks like a memory leak? Any suggestions would be appreciated.
>

Bring up more xlarge instances and see if gets you off the ground?
Then work on getting your number of regions down in number?

St.Ack