You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by David chen <c7...@163.com> on 2015/05/13 04:41:56 UTC

How to know the root reason to cause RegionServer OOM?

A RegionServer was killed because OutOfMemory(OOM), although  the process killed can be seen in the Linux message log, but i still have two following problems:
1. How to inspect the root reason to cause OOM?
2  When RegionServer encounters OOM, why can't it free some memories occupied? if so, whether or not killer will not need.
Any ideas can be appreciated!

Re: Re: How to know the root reason to cause RegionServer OOM?

Posted by Ted Yu <yu...@gmail.com>.

Here is what I saw in hbase-hbase-regionserver-hostname.out :

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 30530"...

FYI

On Wed, May 20, 2015 at 3:46 AM, David chen <c7...@163.com> wrote:

> Thanks Ted,
> Sorry, the log extension is ".log.out", so i think the out file you said
> is log file.
> My version is HBase 0.98.6-cdh5.2.0, where is regionserver.out file?
> BTW, i should assure that my scenario is #2, so expect to get your snippet
> from .out file

Re:Re: How to know the root reason to cause RegionServer OOM?

Posted by David chen <c7...@163.com>.

Thanks Ted,
Sorry, the log extension is ".log.out", so i think the out file you said is log file.
My version is HBase 0.98.6-cdh5.2.0, where is regionserver.out file?
BTW, i should assure that my scenario is #2, so expect to get your snippet from .out file

Re: How to know the root reason to cause RegionServer OOM?

Posted by Ted Yu <yu...@gmail.com>.

For scenario #1, please check regionserver.out file - not log file. 

I was able to reproduce scenario #1 by giving regionserver 124mb heap. As soon as I put load on the server, server was killed by "kill -9" command. 

I can send you snippet from .out file in the morning. 

Cheers



> On May 20, 2015, at 1:46 AM, David chen <c7...@163.com> wrote:
> 
> Thanks Ted,
> For scenario #1, can not see any clues in regionserver log file that denotes "kill -9" command was executed. Meanwhile, i think when JVM inspects regionserver process OOME, it will create a new thread to execute "kill -9 %p", the new thread should not write regionserver log, so the fact, there is not any clues in regionserver log, is normal. Right?
> For scenario #2, dmesg also did not provide any clues. But some clues were seen in /var/log/messages:
> ......
> May 14 12:00:38 localhost kernel: Out of memory: Kill process 22827 (java) score 497 or sacrifice child
> May 14 12:00:38 localhost kernel: Killed process 22827, UID 483, (java) total-vm:17569220kB, anon-rss:16296276kB, file-rss:240kB
> ......
> The 22827 above is regionserver PID.
> It looks like regionserver itself OOM(total-vm:17569220kB, anon-rss:16296276kB, the max-heap-size set is 15G), so was killed. Right?
> But hbase has no heavy load in the cluster, so i don't think it was killed because of itself OOME, instead i think because of lack of memory for other applications, so OS kill regionserver to run more applications. 
> I currently has no evidence to prove my idea, so hope more helps. Thanks.
> 
> 
> 
> 
> 
> 
> 
> 
> At 2015-05-20 10:04:19, "Ted Yu" <yu...@gmail.com> wrote:
>> For scenario #1, you would see in the regionserver.out file that "kill -9 "
>> command was applied due to OOME.
>> 
>> For scenario #2, can you see if dmesg provides some clue ?
>> 
>> Cheers
>> 
>>> On Tue, May 19, 2015 at 6:32 PM, David chen <c7...@163.com> wrote:
>>> 
>>> Thanks for guys reply, its indeed helped me.
>>> Another question, I think there are two possibilities to kill RegionServer
>>> process:
>>> 1. When JVM inspects that the memory, RegionServer has occupied, exceed
>>> the max-heap-size,  then JVM calls positively the command configured by
>>> option "-XX:OnOutOfMemoryError=kill -9 %p" to kill RegionServer  process.
>>> 2. RegionServer process does not reach the max-heap-size, but new
>>> application need to allocation memory,  if lack of memory, OS will choose
>>> to kill some processes, RegionServer unfortunately becomes the first
>>> choice, so it  is killed by OS.
>>> Is my understanding right? If so, how to know which possibility my scene
>>> is?
>>> Any ideas can be appreciated!
>>>

Re:Re: How to know the root reason to cause RegionServer OOM?

Posted by David chen <c7...@163.com>.

Thanks Ted and Stack,
@Ted,  i also reproduced the OOM for scenario #1, and found the hints in the out file(it is /var/run/cloudera-scm-agent/process/*-hbase-REGIONSERVER/logs/stdout.log in CDH).
@Stack, it is not indeed hbase issue,  just because other application occupied more memory, then OS chose to killed RegionServer process.
Thanks for guys helps again.

Re: How to know the root reason to cause RegionServer OOM?

Posted by Stack <st...@duboce.net>.

On Wed, May 20, 2015 at 1:46 AM, David chen <c7...@163.com> wrote:

> Thanks Ted,
> For scenario #1, can not see any clues in regionserver log file that
> denotes "kill -9" command was executed. Meanwhile, i think when JVM
> inspects regionserver process OOME, it will create a new thread to execute
> "kill -9 %p", the new thread should not write regionserver log, so the
> fact, there is not any clues in regionserver log, is normal. Right?
> For scenario #2, dmesg also did not provide any clues. But some clues were
> seen in /var/log/messages:
> ......
> May 14 12:00:38 localhost kernel: Out of memory: Kill process 22827 (java)
> score 497 or sacrifice child
> May 14 12:00:38 localhost kernel: Killed process 22827, UID 483, (java)
> total-vm:17569220kB, anon-rss:16296276kB, file-rss:240kB
> ......
> The 22827 above is regionserver PID.
> It looks like regionserver itself OOM(total-vm:17569220kB,
> anon-rss:16296276kB, the max-heap-size set is 15G), so was killed. Right?
>

Yes.


> But hbase has no heavy load in the cluster,


Doesn't matter. You allocated it a heap of 15G. The OS is looking for
memory and is at a extreme (swapping totally disabled?) so it starts
killing random processes. This is not an hbase issue. It is an
oversubscription problem. Google how to address.


> so i don't think it was killed because of itself OOME, instead i think
> because of lack of memory for other applications, so OS kill regionserver
> to run more applications.
> I currently has no evidence to prove my idea, so hope more helps. Thanks.
>

You quote all necessary evidence above.

St.Ack


>
>
>
>
>
>
>
> At 2015-05-20 10:04:19, "Ted Yu" <yu...@gmail.com> wrote:
> >For scenario #1, you would see in the regionserver.out file that "kill -9
> "
> >command was applied due to OOME.
> >
> >For scenario #2, can you see if dmesg provides some clue ?
> >
> >Cheers
> >
> >On Tue, May 19, 2015 at 6:32 PM, David chen <c7...@163.com> wrote:
> >
> >> Thanks for guys reply, its indeed helped me.
> >> Another question, I think there are two possibilities to kill
> RegionServer
> >> process:
> >> 1. When JVM inspects that the memory, RegionServer has occupied, exceed
> >> the max-heap-size,  then JVM calls positively the command configured by
> >> option "-XX:OnOutOfMemoryError=kill -9 %p" to kill RegionServer
> process.
> >> 2. RegionServer process does not reach the max-heap-size, but new
> >> application need to allocation memory,  if lack of memory, OS will
> choose
> >> to kill some processes, RegionServer unfortunately becomes the first
> >> choice, so it  is killed by OS.
> >> Is my understanding right? If so, how to know which possibility my scene
> >> is?
> >> Any ideas can be appreciated!
> >>
>

Re: How to know the root reason to cause RegionServer OOM?

Posted by David chen <c7...@163.com>.

Thanks Ted,
For scenario #1, can not see any clues in regionserver log file that denotes "kill -9" command was executed. Meanwhile, i think when JVM inspects regionserver process OOME, it will create a new thread to execute "kill -9 %p", the new thread should not write regionserver log, so the fact, there is not any clues in regionserver log, is normal. Right?
For scenario #2, dmesg also did not provide any clues. But some clues were seen in /var/log/messages:
......
May 14 12:00:38 localhost kernel: Out of memory: Kill process 22827 (java) score 497 or sacrifice child
May 14 12:00:38 localhost kernel: Killed process 22827, UID 483, (java) total-vm:17569220kB, anon-rss:16296276kB, file-rss:240kB
......
The 22827 above is regionserver PID.
It looks like regionserver itself OOM(total-vm:17569220kB, anon-rss:16296276kB, the max-heap-size set is 15G), so was killed. Right?
But hbase has no heavy load in the cluster, so i don't think it was killed because of itself OOME, instead i think because of lack of memory for other applications, so OS kill regionserver to run more applications. 
I currently has no evidence to prove my idea, so hope more helps. Thanks.

At 2015-05-20 10:04:19, "Ted Yu" <yu...@gmail.com> wrote:
>For scenario #1, you would see in the regionserver.out file that "kill -9 "
>command was applied due to OOME.
>
>For scenario #2, can you see if dmesg provides some clue ?
>
>Cheers
>
>On Tue, May 19, 2015 at 6:32 PM, David chen <c7...@163.com> wrote:
>
>> Thanks for guys reply, its indeed helped me.
>> Another question, I think there are two possibilities to kill RegionServer
>> process:
>> 1. When JVM inspects that the memory, RegionServer has occupied, exceed
>> the max-heap-size,  then JVM calls positively the command configured by
>> option "-XX:OnOutOfMemoryError=kill -9 %p" to kill RegionServer  process.
>> 2. RegionServer process does not reach the max-heap-size, but new
>> application need to allocation memory,  if lack of memory, OS will choose
>> to kill some processes, RegionServer unfortunately becomes the first
>> choice, so it  is killed by OS.
>> Is my understanding right? If so, how to know which possibility my scene
>> is?
>> Any ideas can be appreciated!
>>

Re: Re: Re: How to know the root reason to cause RegionServer OOM?

Posted by Ted Yu <yu...@gmail.com>.

For scenario #1, you would see in the regionserver.out file that "kill -9 "
command was applied due to OOME.

For scenario #2, can you see if dmesg provides some clue ?

Cheers

On Tue, May 19, 2015 at 6:32 PM, David chen <c7...@163.com> wrote:

> Thanks for guys reply, its indeed helped me.
> Another question, I think there are two possibilities to kill RegionServer
> process:
> 1. When JVM inspects that the memory, RegionServer has occupied, exceed
> the max-heap-size,  then JVM calls positively the command configured by
> option "-XX:OnOutOfMemoryError=kill -9 %p" to kill RegionServer  process.
> 2. RegionServer process does not reach the max-heap-size, but new
> application need to allocation memory,  if lack of memory, OS will choose
> to kill some processes, RegionServer unfortunately becomes the first
> choice, so it  is killed by OS.
> Is my understanding right? If so, how to know which possibility my scene
> is?
> Any ideas can be appreciated!
>

Re:Re: Re: How to know the root reason to cause RegionServer OOM?

Posted by David chen <c7...@163.com>.

Thanks for guys reply, its indeed helped me.
Another question, I think there are two possibilities to kill RegionServer process:
1. When JVM inspects that the memory, RegionServer has occupied, exceed the max-heap-size,  then JVM calls positively the command configured by option "-XX:OnOutOfMemoryError=kill -9 %p" to kill RegionServer  process. 
2. RegionServer process does not reach the max-heap-size, but new application need to allocation memory,  if lack of memory, OS will choose to kill some processes, RegionServer unfortunately becomes the first choice, so it  is killed by OS.
Is my understanding right? If so, how to know which possibility my scene is?
Any ideas can be appreciated!

Re: Re: How to know the root reason to cause RegionServer OOM?

Posted by Sean Busbey <bu...@cloudera.com>.

On Mon, May 18, 2015 at 11:47 AM, Andrew Purtell <ap...@apache.org>
wrote:

> You need to not overcommit memory on servers running JVMs for HDFS and
> HBase (and YARN, and containers, if colocating Hadoop MR). Sum the -Xmx
> parameter, the maximum heap size, for all JVMs that will be concurrently
> executing on the server. The total should be less than the total amount of
> RAM available on the server. Additionally you will want to reserve ~1GB for
> the OS. Finally, set vm.swappiness=0 in /etc/sysctl.conf to prevent
> unnecessary paging.
>
>
On 3.5+ kernels you have to set vm.swappiness=1 if you still want to page
to avoid OOM.

-- 
Sean

Re: Re: How to know the root reason to cause RegionServer OOM?

Posted by Andrew Purtell <ap...@apache.org>.

You need to not overcommit memory on servers running JVMs for HDFS and
HBase (and YARN, and containers, if colocating Hadoop MR). Sum the -Xmx
parameter, the maximum heap size, for all JVMs that will be concurrently
executing on the server. The total should be less than the total amount of
RAM available on the server. Additionally you will want to reserve ~1GB for
the OS. Finally, set vm.swappiness=0 in /etc/sysctl.conf to prevent
unnecessary paging.


On Sun, May 17, 2015 at 8:08 PM, David chen <c7...@163.com> wrote:

> The snippet in /var/log/messages is as follows, i am sure that process
> killed(22827) is RegsionServer.
> ......
> May 14 12:00:38 localhost kernel: Mem-Info:
> May 14 12:00:38 localhost kernel: Node 0 DMA per-cpu:
> May 14 12:00:38 localhost kernel: CPU    0: hi:    0, btch:   1 usd:   0
> ......
> May 14 12:00:38 localhost kernel: CPU   39: hi:    0, btch:   1 usd:   0
> May 14 12:00:38 localhost kernel: Node 0 DMA32 per-cpu:
> May 14 12:00:38 localhost kernel: CPU    0: hi:  186, btch:  31 usd:  30
> ......
> May 14 12:00:38 localhost kernel: CPU   39: hi:  186, btch:  31 usd:   8
> May 14 12:00:38 localhost kernel: Node 0 Normal per-cpu:
> May 14 12:00:38 localhost kernel: CPU    0: hi:  186, btch:  31 usd:   5
> ......
> May 14 12:00:38 localhost kernel: CPU   39: hi:  186, btch:  31 usd:  20
> May 14 12:00:38 localhost kernel: Node 1 Normal per-cpu:
> May 14 12:00:38 localhost kernel: CPU    0: hi:  186, btch:  31 usd:   7
> ......
> May 14 12:00:38 localhost kernel: CPU   39: hi:  186, btch:  31 usd:  10
> May 14 12:00:38 localhost kernel: active_anon:7993118 inactive_anon:48001
> isolated_anon:0
> May 14 12:00:38 localhost kernel: active_file:855 inactive_file:960
> isolated_file:0
> May 14 12:00:38 localhost kernel: unevictable:0 dirty:0 writeback:0
> unstable:0
> May 14 12:00:38 localhost kernel: free:39239 slab_reclaimable:14043
> slab_unreclaimable:27993
> May 14 12:00:38 localhost kernel: mapped:48750 shmem:75053
> pagetables:20540 bounce:0
> May 14 12:00:38 localhost kernel: Node 0 DMA free:15732kB min:40kB
> low:48kB high:60kB active_anon:0kB inactive_anon:0kB active_file:0kB
> inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:15336kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
> slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> May 14 12:00:38 localhost kernel: lowmem_reserve[]: 0 3211 16088 16088
> May 14 12:00:38 localhost kernel: Node 0 DMA32 free:60388kB min:8968kB
> low:11208kB high:13452kB active_anon:2811676kB inactive_anon:72kB
> active_file:0kB inactive_file:788kB unevictable:0kB isolated(anon):0kB
> isolated(file):0kB present:3288224kB mlocked:0kB dirty:0kB writeback:44kB
> mapped:156kB shmem:8232kB slab_reclaimable:10652kB
> slab_unreclaimable:5144kB kernel_stack:56kB pagetables:4252kB unstable:0kB
> bounce:0kB writeback_tmp:0kB pages_scanned:1312 all_unreclaimable? yes
> May 14 12:00:38 localhost kernel: lowmem_reserve[]: 0 0 12877 12877
> May 14 12:00:38 localhost kernel: Node 0 Normal free:35772kB min:35964kB
> low:44952kB high:53944kB active_anon:13062472kB inactive_anon:4864kB
> active_file:1268kB inactive_file:1504kB unevictable:0kB isolated(anon):0kB
> isolated(file):0kB present:13186560kB mlocked:0kB dirty:0kB writeback:92kB
> mapped:6172kB shmem:51928kB slab_reclaimable:22732kB
> slab_unreclaimable:73204kB kernel_stack:16240kB pagetables:38040kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:10268
> all_unreclaimable? yes
> May 14 12:00:38 localhost kernel: lowmem_reserve[]: 0 0 0 0
> May 14 12:00:38 localhost kernel: Node 1 Normal free:45064kB min:45132kB
> low:56412kB high:67696kB active_anon:16098324kB inactive_anon:187068kB
> active_file:2192kB inactive_file:1548kB unevictable:0kB isolated(anon):0kB
> isolated(file):0kB present:16547840kB mlocked:0kB dirty:116kB writeback:0kB
> mapped:188672kB shmem:240052kB slab_reclaimable:22788kB
> slab_unreclaimable:33624kB kernel_stack:7352kB pagetables:39868kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:12064
> all_unreclaimable? yes
> May 14 12:00:38 localhost kernel: lowmem_reserve[]: 0 0 0 0
> May 14 12:00:38 localhost kernel: Node 0 DMA: 1*4kB 0*8kB 1*16kB 1*32kB
> 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15732kB
> May 14 12:00:38 localhost kernel: Node 0 DMA32: 659*4kB 576*8kB 485*16kB
> 338*32kB 208*64kB 106*128kB 27*256kB 2*512kB 0*1024kB 0*2048kB 0*4096kB =
> 60636kB
> May 14 12:00:38 localhost kernel: Node 0 Normal: 1166*4kB 579*8kB 337*16kB
> 203*32kB 106*64kB 61*128kB 3*256kB 2*512kB 0*1024kB 0*2048kB 0*4096kB =
> 37568kB
> May 14 12:00:38 localhost kernel: Node 1 Normal: 668*4kB 405*8kB 422*16kB
> 259*32kB 176*64kB 67*128kB 7*256kB 2*512kB 0*1024kB 0*2048kB 0*4096kB =
> 43608kB
> May 14 12:00:38 localhost kernel: 78257 total pagecache pages
> May 14 12:00:38 localhost kernel: 0 pages in swap cache
> May 14 12:00:38 localhost kernel: Swap cache stats: add 0, delete 0, find
> 0/0
> May 14 12:00:38 localhost kernel: Free swap  = 0kB
> May 14 12:00:38 localhost kernel: Total swap = 0kB
> May 14 12:00:38 localhost kernel: 8388607 pages RAM
> May 14 12:00:38 localhost kernel: 181753 pages reserved
> May 14 12:00:38 localhost kernel: 77957 pages shared
> May 14 12:00:38 localhost kernel: 8104642 pages non-shared
> May 14 12:00:38 localhost kernel: [ pid ]   uid  tgid total_vm      rss
> cpu oom_adj oom_score_adj name
> ......
> May 14 12:00:38 localhost kernel: [22827]   483 22827  4392305  4074129
> 23       0             0 java
> May 14 12:00:38 localhost kernel: [38727]   483 38727   428355    74385
> 22       0             0 java
> ......
> May 14 12:00:38 localhost kernel: Out of memory: Kill process 22827 (java)
> score 497 or sacrifice child
> May 14 12:00:38 localhost kernel: Killed process 22827, UID 483, (java)
> total-vm:17569220kB, anon-rss:16296276kB, file-rss:240kB
> May 14 12:00:38 localhost kernel: sleep invoked oom-killer:
> gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
> May 14 12:00:38 localhost kernel: sleep cpuset=/ mems_allowed=0-1
> May 14 12:00:38 localhost kernel: Pid: 31136, comm: sleep Not tainted
> 2.6.32-358.el6.x86_64 #1
> May 14 12:00:38 localhost kernel: Call Trace:
> May 14 12:00:38 localhost kernel: [<ffffffff810cb5d1>] ?
> cpuset_print_task_mems_allowed+0x91/0xb0
> May 14 12:00:38 localhost kernel: [<ffffffff8111cd10>] ?
> dump_header+0x90/0x1b0
> May 14 12:00:38 localhost kernel: [<ffffffff810e91ee>] ?
> __delayacct_freepages_end+0x2e/0x30
> May 14 12:00:38 localhost kernel: [<ffffffff8121d0bc>] ?
> security_real_capable_noaudit+0x3c/0x70
> May 14 12:00:38 localhost kernel: [<ffffffff8111d192>] ?
> oom_kill_process+0x82/0x2a0
> May 14 12:00:38 localhost kernel: [<ffffffff8111d0d1>] ?
> select_bad_process+0xe1/0x120
> May 14 12:00:38 localhost kernel: [<ffffffff8111d5d0>] ?
> out_of_memory+0x220/0x3c0
> May 14 12:00:38 localhost kernel: [<ffffffff8112c27c>] ?
> __alloc_pages_nodemask+0x8ac/0x8d0
> May 14 12:00:38 localhost kernel: [<ffffffff8116087a>] ?
> alloc_pages_current+0xaa/0x110
> May 14 12:00:38 localhost kernel: [<ffffffff8111a0f7>] ?
> __page_cache_alloc+0x87/0x90
> May 14 12:00:38 localhost kernel: [<ffffffff81119ade>] ?
> find_get_page+0x1e/0xa0
> May 14 12:00:38 localhost kernel: [<ffffffff8111b0b7>] ?
> filemap_fault+0x1a7/0x500
> May 14 12:00:38 localhost kernel: [<ffffffff811430b4>] ?
> __do_fault+0x54/0x530
> May 14 12:00:38 localhost kernel: [<ffffffff81059784>] ?
> find_busiest_group+0x244/0x9f0
> May 14 12:00:38 localhost kernel: [<ffffffff81143687>] ?
> handle_pte_fault+0xf7/0xb50
> May 14 12:00:38 localhost kernel: [<ffffffff8105e203>] ?
> perf_event_task_sched_out+0x33/0x80
> May 14 12:00:38 localhost kernel: [<ffffffff8114431a>] ?
> handle_mm_fault+0x23a/0x310
> May 14 12:00:38 localhost kernel: [<ffffffff810474c9>] ?
> __do_page_fault+0x139/0x480
> May 14 12:00:38 localhost kernel: [<ffffffff8109be2f>] ?
> hrtimer_try_to_cancel+0x3f/0xd0
> May 14 12:00:38 localhost kernel: [<ffffffff8109bee2>] ?
> hrtimer_cancel+0x22/0x30
> May 14 12:00:38 localhost kernel: [<ffffffff8150f1b3>] ?
> do_nanosleep+0x93/0xc0
> May 14 12:00:38 localhost kernel: [<ffffffff8109bfb4>] ?
> hrtimer_nanosleep+0xc4/0x180
> May 14 12:00:38 localhost kernel: [<ffffffff8109ae00>] ?
> hrtimer_wakeup+0x0/0x30
> May 14 12:00:38 localhost kernel: [<ffffffff8151311e>] ?
> do_page_fault+0x3e/0xa0
> May 14 12:00:38 localhost kernel: [<ffffffff815104d5>] ?
> page_fault+0x25/0x30
> ......
>
>
>
>
>
>
>
>
>
>
> At 2015-05-16 02:39:02, "iain wright" <ia...@gmail.com> wrote:
> >What log is this seen in? Can you paste the log line? Do you mean
> >/var/log/messages?
> >On May 12, 2015 7:44 PM, "David chen" <c7...@163.com> wrote:
> >
> >> A RegionServer was killed because OutOfMemory(OOM), although  the
> process
> >> killed can be seen in the Linux message log, but i still have two
> following
> >> problems:
> >> 1. How to inspect the root reason to cause OOM?
> >> 2  When RegionServer encounters OOM, why can't it free some memories
> >> occupied? if so, whether or not killer will not need.
> >> Any ideas can be appreciated!
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re:Re: How to know the root reason to cause RegionServer OOM?

Posted by David chen <c7...@163.com>.

The snippet in /var/log/messages is as follows, i am sure that process killed(22827) is RegsionServer.
......
May 14 12:00:38 localhost kernel: Mem-Info:
May 14 12:00:38 localhost kernel: Node 0 DMA per-cpu:
May 14 12:00:38 localhost kernel: CPU    0: hi:    0, btch:   1 usd:   0
......
May 14 12:00:38 localhost kernel: CPU   39: hi:    0, btch:   1 usd:   0
May 14 12:00:38 localhost kernel: Node 0 DMA32 per-cpu:
May 14 12:00:38 localhost kernel: CPU    0: hi:  186, btch:  31 usd:  30
......
May 14 12:00:38 localhost kernel: CPU   39: hi:  186, btch:  31 usd:   8
May 14 12:00:38 localhost kernel: Node 0 Normal per-cpu:
May 14 12:00:38 localhost kernel: CPU    0: hi:  186, btch:  31 usd:   5
......
May 14 12:00:38 localhost kernel: CPU   39: hi:  186, btch:  31 usd:  20
May 14 12:00:38 localhost kernel: Node 1 Normal per-cpu:
May 14 12:00:38 localhost kernel: CPU    0: hi:  186, btch:  31 usd:   7
......
May 14 12:00:38 localhost kernel: CPU   39: hi:  186, btch:  31 usd:  10
May 14 12:00:38 localhost kernel: active_anon:7993118 inactive_anon:48001 isolated_anon:0
May 14 12:00:38 localhost kernel: active_file:855 inactive_file:960 isolated_file:0
May 14 12:00:38 localhost kernel: unevictable:0 dirty:0 writeback:0 unstable:0
May 14 12:00:38 localhost kernel: free:39239 slab_reclaimable:14043 slab_unreclaimable:27993
May 14 12:00:38 localhost kernel: mapped:48750 shmem:75053 pagetables:20540 bounce:0
May 14 12:00:38 localhost kernel: Node 0 DMA free:15732kB min:40kB low:48kB high:60kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15336kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
May 14 12:00:38 localhost kernel: lowmem_reserve[]: 0 3211 16088 16088
May 14 12:00:38 localhost kernel: Node 0 DMA32 free:60388kB min:8968kB low:11208kB high:13452kB active_anon:2811676kB inactive_anon:72kB active_file:0kB inactive_file:788kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3288224kB mlocked:0kB dirty:0kB writeback:44kB mapped:156kB shmem:8232kB slab_reclaimable:10652kB slab_unreclaimable:5144kB kernel_stack:56kB pagetables:4252kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1312 all_unreclaimable? yes
May 14 12:00:38 localhost kernel: lowmem_reserve[]: 0 0 12877 12877
May 14 12:00:38 localhost kernel: Node 0 Normal free:35772kB min:35964kB low:44952kB high:53944kB active_anon:13062472kB inactive_anon:4864kB active_file:1268kB inactive_file:1504kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:0kB dirty:0kB writeback:92kB mapped:6172kB shmem:51928kB slab_reclaimable:22732kB slab_unreclaimable:73204kB kernel_stack:16240kB pagetables:38040kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:10268 all_unreclaimable? yes
May 14 12:00:38 localhost kernel: lowmem_reserve[]: 0 0 0 0
May 14 12:00:38 localhost kernel: Node 1 Normal free:45064kB min:45132kB low:56412kB high:67696kB active_anon:16098324kB inactive_anon:187068kB active_file:2192kB inactive_file:1548kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16547840kB mlocked:0kB dirty:116kB writeback:0kB mapped:188672kB shmem:240052kB slab_reclaimable:22788kB slab_unreclaimable:33624kB kernel_stack:7352kB pagetables:39868kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:12064 all_unreclaimable? yes
May 14 12:00:38 localhost kernel: lowmem_reserve[]: 0 0 0 0
May 14 12:00:38 localhost kernel: Node 0 DMA: 1*4kB 0*8kB 1*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15732kB
May 14 12:00:38 localhost kernel: Node 0 DMA32: 659*4kB 576*8kB 485*16kB 338*32kB 208*64kB 106*128kB 27*256kB 2*512kB 0*1024kB 0*2048kB 0*4096kB = 60636kB
May 14 12:00:38 localhost kernel: Node 0 Normal: 1166*4kB 579*8kB 337*16kB 203*32kB 106*64kB 61*128kB 3*256kB 2*512kB 0*1024kB 0*2048kB 0*4096kB = 37568kB
May 14 12:00:38 localhost kernel: Node 1 Normal: 668*4kB 405*8kB 422*16kB 259*32kB 176*64kB 67*128kB 7*256kB 2*512kB 0*1024kB 0*2048kB 0*4096kB = 43608kB
May 14 12:00:38 localhost kernel: 78257 total pagecache pages
May 14 12:00:38 localhost kernel: 0 pages in swap cache
May 14 12:00:38 localhost kernel: Swap cache stats: add 0, delete 0, find 0/0
May 14 12:00:38 localhost kernel: Free swap  = 0kB
May 14 12:00:38 localhost kernel: Total swap = 0kB
May 14 12:00:38 localhost kernel: 8388607 pages RAM
May 14 12:00:38 localhost kernel: 181753 pages reserved
May 14 12:00:38 localhost kernel: 77957 pages shared
May 14 12:00:38 localhost kernel: 8104642 pages non-shared
May 14 12:00:38 localhost kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
......
May 14 12:00:38 localhost kernel: [22827]   483 22827  4392305  4074129  23       0             0 java
May 14 12:00:38 localhost kernel: [38727]   483 38727   428355    74385  22       0             0 java
......
May 14 12:00:38 localhost kernel: Out of memory: Kill process 22827 (java) score 497 or sacrifice child
May 14 12:00:38 localhost kernel: Killed process 22827, UID 483, (java) total-vm:17569220kB, anon-rss:16296276kB, file-rss:240kB
May 14 12:00:38 localhost kernel: sleep invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
May 14 12:00:38 localhost kernel: sleep cpuset=/ mems_allowed=0-1
May 14 12:00:38 localhost kernel: Pid: 31136, comm: sleep Not tainted 2.6.32-358.el6.x86_64 #1
May 14 12:00:38 localhost kernel: Call Trace:
May 14 12:00:38 localhost kernel: [<ffffffff810cb5d1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
May 14 12:00:38 localhost kernel: [<ffffffff8111cd10>] ? dump_header+0x90/0x1b0
May 14 12:00:38 localhost kernel: [<ffffffff810e91ee>] ? __delayacct_freepages_end+0x2e/0x30
May 14 12:00:38 localhost kernel: [<ffffffff8121d0bc>] ? security_real_capable_noaudit+0x3c/0x70
May 14 12:00:38 localhost kernel: [<ffffffff8111d192>] ? oom_kill_process+0x82/0x2a0
May 14 12:00:38 localhost kernel: [<ffffffff8111d0d1>] ? select_bad_process+0xe1/0x120
May 14 12:00:38 localhost kernel: [<ffffffff8111d5d0>] ? out_of_memory+0x220/0x3c0
May 14 12:00:38 localhost kernel: [<ffffffff8112c27c>] ? __alloc_pages_nodemask+0x8ac/0x8d0
May 14 12:00:38 localhost kernel: [<ffffffff8116087a>] ? alloc_pages_current+0xaa/0x110
May 14 12:00:38 localhost kernel: [<ffffffff8111a0f7>] ? __page_cache_alloc+0x87/0x90
May 14 12:00:38 localhost kernel: [<ffffffff81119ade>] ? find_get_page+0x1e/0xa0
May 14 12:00:38 localhost kernel: [<ffffffff8111b0b7>] ? filemap_fault+0x1a7/0x500
May 14 12:00:38 localhost kernel: [<ffffffff811430b4>] ? __do_fault+0x54/0x530
May 14 12:00:38 localhost kernel: [<ffffffff81059784>] ? find_busiest_group+0x244/0x9f0
May 14 12:00:38 localhost kernel: [<ffffffff81143687>] ? handle_pte_fault+0xf7/0xb50
May 14 12:00:38 localhost kernel: [<ffffffff8105e203>] ? perf_event_task_sched_out+0x33/0x80
May 14 12:00:38 localhost kernel: [<ffffffff8114431a>] ? handle_mm_fault+0x23a/0x310
May 14 12:00:38 localhost kernel: [<ffffffff810474c9>] ? __do_page_fault+0x139/0x480
May 14 12:00:38 localhost kernel: [<ffffffff8109be2f>] ? hrtimer_try_to_cancel+0x3f/0xd0
May 14 12:00:38 localhost kernel: [<ffffffff8109bee2>] ? hrtimer_cancel+0x22/0x30
May 14 12:00:38 localhost kernel: [<ffffffff8150f1b3>] ? do_nanosleep+0x93/0xc0
May 14 12:00:38 localhost kernel: [<ffffffff8109bfb4>] ? hrtimer_nanosleep+0xc4/0x180
May 14 12:00:38 localhost kernel: [<ffffffff8109ae00>] ? hrtimer_wakeup+0x0/0x30
May 14 12:00:38 localhost kernel: [<ffffffff8151311e>] ? do_page_fault+0x3e/0xa0
May 14 12:00:38 localhost kernel: [<ffffffff815104d5>] ? page_fault+0x25/0x30
......










At 2015-05-16 02:39:02, "iain wright" <ia...@gmail.com> wrote:
>What log is this seen in? Can you paste the log line? Do you mean
>/var/log/messages?
>On May 12, 2015 7:44 PM, "David chen" <c7...@163.com> wrote:
>
>> A RegionServer was killed because OutOfMemory(OOM), although  the process
>> killed can be seen in the Linux message log, but i still have two following
>> problems:
>> 1. How to inspect the root reason to cause OOM?
>> 2  When RegionServer encounters OOM, why can't it free some memories
>> occupied? if so, whether or not killer will not need.
>> Any ideas can be appreciated!

Re: How to know the root reason to cause RegionServer OOM?

Posted by iain wright <ia...@gmail.com>.

What log is this seen in? Can you paste the log line? Do you mean
/var/log/messages?
On May 12, 2015 7:44 PM, "David chen" <c7...@163.com> wrote:

> A RegionServer was killed because OutOfMemory(OOM), although  the process
> killed can be seen in the Linux message log, but i still have two following
> problems:
> 1. How to inspect the root reason to cause OOM?
> 2  When RegionServer encounters OOM, why can't it free some memories
> occupied? if so, whether or not killer will not need.
> Any ideas can be appreciated!

Re: How to know the root reason to cause RegionServer OOM?

Posted by Bryan Beaudreault <bb...@hubspot.com>.

After moving to the G1GC we were plagued with random OOMs from time to
time.  We always thought it was due to people requesting a big row or group
of rows, but upon investigation noticed that the heap dumps were many GBs
less than the max heap at time of OOM.  If you have this symptom, you may
be running into humongous allocation issues.

I think HBase is especially prone to humongous allocations if you are
batching Puts on the client side, or have large cells.  Googling for
humongous allocations will return a lot of useful results.  I found
http://www.infoq.com/articles/tuning-tips-G1-GC to be especially helpful.

The bottom line is this:

- If an allocation is larger than a 50% of the G1 region size, it is a
humongous allocation which is more expensive to clean up.  We want to avoid
this.
- The default region size is only a few mb, so any big batch puts or scans
can easily be considered humongous.  If you don't set Xms, it will be even
smaller.
- Make sure you are setting Xms to the same value as Xmx.  This is used by
the G1 to calculate default region sizes.
- Enable -XX:+PrintAdaptiveSizePolicy, which will print out information you
can use for debugging humongous allocations.  Any time an allocation is
considered humongous, it will print the size of the allocation.  For us,
enabling this setting made it immediately obvious there was an issue.
- Using the output of the above, determine your optimal region size.
Region sizes must be a power of 2, and you should generally target around
2000 regions.  So a compromise is sometimes needed, as you don't want to be
*too* far below this number.
- Use -XX:G1HeapRegionSize=xM to set the region size.  Like I said, use a
power of 2.

For us, we were getting a lot of allocations around 3-5mb.  The largest
percentage were around 3 to less than 4mb.  On our 25GB regionservers, we
set to the region size to 8MB, so that the vast majority of allocations
fell under 50% of 8mb.  The remaining humongous allocations were low enough
volume to work fine.  On our 32GB regionservers, we set this to 16mb and
completely eliminated humongous allocations.

Since the above tuning, G1GC has worked great for us and we have not had
any OOMs in a couple months.

Hope this helps.

On Wed, May 13, 2015 at 10:37 AM, Stack <st...@duboce.net> wrote:

> On Tue, May 12, 2015 at 7:41 PM, David chen <c7...@163.com> wrote:
>
> > A RegionServer was killed because OutOfMemory(OOM), although  the process
> > killed can be seen in the Linux message log, but i still have two
> following
> > problems:
> > 1. How to inspect the root reason to cause OOM?
> >
>
> Start the regionserver with -XX:-HeapDumpOnOutOfMemoryError specifying a
> location for the heap to be dumped to on OOME (See
>
> http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
> ).
> Remove the XX:OnOutOfMemoryError because now it will conflict with
> HeapDumpOnOutOfMemoryError
>  Then open the heap dump in the java mission control, jprofiler, etc., to
> see how the retained objects are associated.
>
>
> > 2  When RegionServer encounters OOM, why can't it free some memories
> > occupied? if so, whether or not killer will not need.
> >
>
> We require a certain amount of memory to process a particular work load. If
> an insufficient allocation, we OOME. Once an application has OOME'd, its
> state goes indeterminate. We opt to kill the process rather than hang
> around in a damaged state.
>
> Enable GC logging to figure why in particular you OOME'd (There are
> different categories of OOME [1]). We may have a sufficient memory
> allocation but an incorrectly tuned GC or a badly specified set of heap
> args may bring on OOME.
>
> St.Ack
> 1.
>
> http://www.javacodegeeks.com/2013/08/understanding-the-outofmemoryerror.html
>
>
> > Any ideas can be appreciated!
>

Re: How to know the root reason to cause RegionServer OOM?

Posted by Stack <st...@duboce.net>.

On Tue, May 12, 2015 at 7:41 PM, David chen <c7...@163.com> wrote:

> A RegionServer was killed because OutOfMemory(OOM), although  the process
> killed can be seen in the Linux message log, but i still have two following
> problems:
> 1. How to inspect the root reason to cause OOM?
>

Start the regionserver with -XX:-HeapDumpOnOutOfMemoryError specifying a
location for the heap to be dumped to on OOME (See
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html).
Remove the XX:OnOutOfMemoryError because now it will conflict with
HeapDumpOnOutOfMemoryError
 Then open the heap dump in the java mission control, jprofiler, etc., to
see how the retained objects are associated.

> 2  When RegionServer encounters OOM, why can't it free some memories
> occupied? if so, whether or not killer will not need.
>

We require a certain amount of memory to process a particular work load. If
an insufficient allocation, we OOME. Once an application has OOME'd, its
state goes indeterminate. We opt to kill the process rather than hang
around in a damaged state.

Enable GC logging to figure why in particular you OOME'd (There are
different categories of OOME [1]). We may have a sufficient memory
allocation but an incorrectly tuned GC or a badly specified set of heap
args may bring on OOME.

St.Ack
1.
http://www.javacodegeeks.com/2013/08/understanding-the-outofmemoryerror.html

> Any ideas can be appreciated!

Re: Re: Re: How to know the root reason to cause RegionServer OOM?

Posted by Ted Yu <yu...@gmail.com>.

I got '502 Bad Gateway' trying to access the post David mentioned.

Here is the same article in case you get 502 error:
http://java.dzone.com/articles/OOM-relation-to-swappiness

FYI

On Thu, May 14, 2015 at 2:40 AM, David chen <c7...@163.com> wrote:

> Thanks for guys' helps.
> Maybe the root reason is to turn off swap.
> The cluster contains seven Region servers, although all set vm.swappiness
> to 0, but two of them has always turned off swap, others turned on.
> Meanwhile OOM also always encountered on the two machines.
> I plan to turn on swap and also set vm.swappiness to 1, the latter is
> because of the post(http://www.innomysql.net/article/2790.html).
> Any ideas?

Re:Re: Re: How to know the root reason to cause RegionServer OOM?

Posted by David chen <c7...@163.com>.

Thanks for guys' helps. 
Maybe the root reason is to turn off swap.
The cluster contains seven Region servers, although all set vm.swappiness to 0, but two of them has always turned off swap, others turned on. Meanwhile OOM also always encountered on the two machines.
I plan to turn on swap and also set vm.swappiness to 1, the latter is because of the post(http://www.innomysql.net/article/2790.html).
Any ideas?

Re: Re: How to know the root reason to cause RegionServer OOM?

Posted by Elliott Clark <ec...@apache.org>.

On Wed, May 13, 2015 at 12:59 AM, David chen <c7...@163.com> wrote:

> -XX:MaxGCPauseMillis=6000


With this line you're basically telling java to never garbage collect. Can
you try lowering that to something closer to the jvm default and see if you
have better stability?

Re: Re: Re: How to know the root reason to cause RegionServer OOM?

Posted by Ted Yu <yu...@gmail.com>.

I should have mentioned in previous email that I was looking at code in
branch-1

bq. why the fix version is 1.1.0 in HBASE-11544?
See release note:
Incompatible Change: The return type of InternalScanners#next and
RegionScanners#nextRaw has been changed to NextState from boolean

Cheers

On Fri, May 15, 2015 at 3:06 AM, David chen <c7...@163.com> wrote:

> Hi Ted,
> I read the code snippet, you provided HRegionServer#Scan, in 0.98.5
> version, it looks like that the partial row is returned.
> If so, the partial row has been fixed in 0.98.5 version, why the fix
> version is 1.1.0 in HBASE-11544?
>
>
> At 2015-05-14 01:04:35, "Ted Yu" <yu...@gmail.com> wrote:
> >For #2, partial row would be returned.
> >
> >Please take a look at the following method in RSRpcServices around line
> >2393 :
> >
> >  public ScanResponse scan(final RpcController controller, final
> >ScanRequest request)
> >
> >Cheers
> >
> >On Wed, May 13, 2015 at 12:59 AM, David chen <c7...@163.com> wrote:
> >
> >> Thanks for you reply.
> >> Yes, it indeed appeared in the RegionServer command as follows:
> >> jps -v|grep "Region"
> >> HRegionServer -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p
> >> -Xmx1000m -Djava.net.preferIPv4Stack=true -Xms16106127360 -Xmx16106127360
> >> -XX:+UseG1GC -XX:MaxGCPauseMillis=6000
> >> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh
> >>
> >>
> >> After read HBASE-11544, i have some doubts:
> >> 1. Assume scan has set caching to 1 and batch to 1, for a row with 2
> >> cells, the first RPC should only return a cell of the row, it is also the
> >> partial of a row. Unless the cell is too large size, otherwise, will not
> >> need HBASE-11544. right?
> >> 2. Assume scan has set caching to 1 and maxResultSize to 1, for a row
> >> which per cell size is more than 1, will the first RPC return the whole or
> >> partial row? I think the whole row, right?
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> At 2015-05-13 11:04:04, "Ted Yu" <yu...@gmail.com> wrote:
> >> >Does the following appear in the command which launched region server ?
> >> >-XX:OnOutOfMemoryError="kill -9 %p"
> >> >
> >> >There could be multiple reasons for region server process to encounter
> >> OOME.
> >> >Please take a look at HBASE-11544 which fixes a common cause. The fix is
> >> in
> >> >the upcoming 1.1.0 release.
> >> >
> >> >Cheers
> >> >
> >> >On Tue, May 12, 2015 at 7:41 PM, David chen <c7...@163.com> wrote:
> >> >
> >> >> A RegionServer was killed because OutOfMemory(OOM), although  the
> >> process
> >> >> killed can be seen in the Linux message log, but i still have two
> >> following
> >> >> problems:
> >> >> 1. How to inspect the root reason to cause OOM?
> >> >> 2  When RegionServer encounters OOM, why can't it free some memories
> >> >> occupied? if so, whether or not killer will not need.
> >> >> Any ideas can be appreciated!
> >>
>
>

Re:Re: Re: How to know the root reason to cause RegionServer OOM?

Posted by David chen <c7...@163.com>.

Hi Ted,
I read the code snippet, you provided HRegionServer#Scan, in 0.98.5 version, it looks like that the partial row is returned.
If so, the partial row has been fixed in 0.98.5 version, why the fix version is 1.1.0 in HBASE-11544?

At 2015-05-14 01:04:35, "Ted Yu" <yu...@gmail.com> wrote:
>For #2, partial row would be returned.
>
>Please take a look at the following method in RSRpcServices around line
>2393 :
>
>  public ScanResponse scan(final RpcController controller, final
>ScanRequest request)
>
>Cheers
>
>On Wed, May 13, 2015 at 12:59 AM, David chen <c7...@163.com> wrote:
>
>> Thanks for you reply.
>> Yes, it indeed appeared in the RegionServer command as follows:
>> jps -v|grep "Region"
>> HRegionServer -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p
>> -Xmx1000m -Djava.net.preferIPv4Stack=true -Xms16106127360 -Xmx16106127360
>> -XX:+UseG1GC -XX:MaxGCPauseMillis=6000
>> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh
>>
>>
>> After read HBASE-11544, i have some doubts:
>> 1. Assume scan has set caching to 1 and batch to 1, for a row with 2
>> cells, the first RPC should only return a cell of the row, it is also the
>> partial of a row. Unless the cell is too large size, otherwise, will not
>> need HBASE-11544. right?
>> 2. Assume scan has set caching to 1 and maxResultSize to 1, for a row
>> which per cell size is more than 1, will the first RPC return the whole or
>> partial row? I think the whole row, right?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2015-05-13 11:04:04, "Ted Yu" <yu...@gmail.com> wrote:
>> >Does the following appear in the command which launched region server ?
>> >-XX:OnOutOfMemoryError="kill -9 %p"
>> >
>> >There could be multiple reasons for region server process to encounter
>> OOME.
>> >Please take a look at HBASE-11544 which fixes a common cause. The fix is
>> in
>> >the upcoming 1.1.0 release.
>> >
>> >Cheers
>> >
>> >On Tue, May 12, 2015 at 7:41 PM, David chen <c7...@163.com> wrote:
>> >
>> >> A RegionServer was killed because OutOfMemory(OOM), although  the
>> process
>> >> killed can be seen in the Linux message log, but i still have two
>> following
>> >> problems:
>> >> 1. How to inspect the root reason to cause OOM?
>> >> 2  When RegionServer encounters OOM, why can't it free some memories
>> >> occupied? if so, whether or not killer will not need.
>> >> Any ideas can be appreciated!
>>

Re: Re: How to know the root reason to cause RegionServer OOM?

Posted by Ted Yu <yu...@gmail.com>.

For #2, partial row would be returned.

Please take a look at the following method in RSRpcServices around line
2393 :

  public ScanResponse scan(final RpcController controller, final
ScanRequest request)

Cheers

On Wed, May 13, 2015 at 12:59 AM, David chen <c7...@163.com> wrote:

> Thanks for you reply.
> Yes, it indeed appeared in the RegionServer command as follows:
> jps -v|grep "Region"
> HRegionServer -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p
> -Xmx1000m -Djava.net.preferIPv4Stack=true -Xms16106127360 -Xmx16106127360
> -XX:+UseG1GC -XX:MaxGCPauseMillis=6000
> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh
>
>
> After read HBASE-11544, i have some doubts:
> 1. Assume scan has set caching to 1 and batch to 1, for a row with 2
> cells, the first RPC should only return a cell of the row, it is also the
> partial of a row. Unless the cell is too large size, otherwise, will not
> need HBASE-11544. right?
> 2. Assume scan has set caching to 1 and maxResultSize to 1, for a row
> which per cell size is more than 1, will the first RPC return the whole or
> partial row? I think the whole row, right?
>
>
>
>
>
>
>
>
>
>
> At 2015-05-13 11:04:04, "Ted Yu" <yu...@gmail.com> wrote:
> >Does the following appear in the command which launched region server ?
> >-XX:OnOutOfMemoryError="kill -9 %p"
> >
> >There could be multiple reasons for region server process to encounter
> OOME.
> >Please take a look at HBASE-11544 which fixes a common cause. The fix is
> in
> >the upcoming 1.1.0 release.
> >
> >Cheers
> >
> >On Tue, May 12, 2015 at 7:41 PM, David chen <c7...@163.com> wrote:
> >
> >> A RegionServer was killed because OutOfMemory(OOM), although  the
> process
> >> killed can be seen in the Linux message log, but i still have two
> following
> >> problems:
> >> 1. How to inspect the root reason to cause OOM?
> >> 2  When RegionServer encounters OOM, why can't it free some memories
> >> occupied? if so, whether or not killer will not need.
> >> Any ideas can be appreciated!
>

Re:Re: How to know the root reason to cause RegionServer OOM?

Posted by David chen <c7...@163.com>.

Thanks for you reply.
Yes, it indeed appeared in the RegionServer command as follows:
jps -v|grep "Region"
HRegionServer -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m -Djava.net.preferIPv4Stack=true -Xms16106127360 -Xmx16106127360 -XX:+UseG1GC -XX:MaxGCPauseMillis=6000 -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh

After read HBASE-11544, i have some doubts:
1. Assume scan has set caching to 1 and batch to 1, for a row with 2 cells, the first RPC should only return a cell of the row, it is also the partial of a row. Unless the cell is too large size, otherwise, will not need HBASE-11544. right?
2. Assume scan has set caching to 1 and maxResultSize to 1, for a row which per cell size is more than 1, will the first RPC return the whole or partial row? I think the whole row, right?

At 2015-05-13 11:04:04, "Ted Yu" <yu...@gmail.com> wrote:
>Does the following appear in the command which launched region server ?
>-XX:OnOutOfMemoryError="kill -9 %p"
>
>There could be multiple reasons for region server process to encounter OOME.
>Please take a look at HBASE-11544 which fixes a common cause. The fix is in
>the upcoming 1.1.0 release.
>
>Cheers
>
>On Tue, May 12, 2015 at 7:41 PM, David chen <c7...@163.com> wrote:
>
>> A RegionServer was killed because OutOfMemory(OOM), although  the process
>> killed can be seen in the Linux message log, but i still have two following
>> problems:
>> 1. How to inspect the root reason to cause OOM?
>> 2  When RegionServer encounters OOM, why can't it free some memories
>> occupied? if so, whether or not killer will not need.
>> Any ideas can be appreciated!

Re: How to know the root reason to cause RegionServer OOM?

Posted by Ted Yu <yu...@gmail.com>.

Does the following appear in the command which launched region server ?
-XX:OnOutOfMemoryError="kill -9 %p"

There could be multiple reasons for region server process to encounter OOME.
Please take a look at HBASE-11544 which fixes a common cause. The fix is in
the upcoming 1.1.0 release.

Cheers

On Tue, May 12, 2015 at 7:41 PM, David chen <c7...@163.com> wrote:

> A RegionServer was killed because OutOfMemory(OOM), although  the process
> killed can be seen in the Linux message log, but i still have two following
> problems:
> 1. How to inspect the root reason to cause OOM?
> 2  When RegionServer encounters OOM, why can't it free some memories
> occupied? if so, whether or not killer will not need.
> Any ideas can be appreciated!