You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Satoshi Hikida <sa...@imagine-orb.com> on 2016/04/20 05:19:16 UTC

Re: Cassandra causing OOM Killer to strike on new cluster running 3.4

Hi,

I'm looking forward to a  patch (file) for this bug(CASSANDRA-11344) to
apply C* version 2.2.5. Is there available patch for that version? I
watched link(https://issues.apache.org/jira/browse/CASSANDRA-11344) but
couldn't find patch file or something like that.  Or is there any
workaround to avoid this bug with version 2.2.5?

Regards,
Satoshi

On Tue, Mar 15, 2016 at 4:47 AM, Adam Plumb <ap...@fiksu.com> wrote:

> OK so good news, I'm running with the patched jar file in my cluster and
> haven't seen any issues.  The bloom filter off-heap memory usage is between
> 1.5GB and 2GB per node, which is much more in-line with what I'm expecting!
>  (thumbsup)
>
> On Mon, Mar 14, 2016 at 9:42 AM, Adam Plumb <ap...@fiksu.com> wrote:
>
>> Thanks for the link!  Luckily the cluster I'm running is not yet in
>> production and running with dummy data so I will throw that jar on the
>> nodes and I'll let you know how things shake out.
>>
>> On Sun, Mar 13, 2016 at 11:02 PM, Paulo Motta <pa...@gmail.com>
>> wrote:
>>
>>> You could be hitting CASSANDRA-11344 (
>>> https://issues.apache.org/jira/browse/CASSANDRA-11344).  If that's the
>>> case, you may try to replace your cassandra jar on an affected node with a
>>> version with this fix in place and force bloom filter regeneration to see
>>> if if it fixes your problem. You can build with "ant jar" from this branch:
>>> https://github.com/pauloricardomg/cassandra/tree/3.4-11344
>>>
>>> You can force bloom filter regeneration by either removing your
>>> *Filter.db files (make sure to backup them before for safety) or changing
>>> the bloom_filter_fp_chance before restarting affected nodes with the fixed
>>> jar.
>>>
>>> 2016-03-13 19:51 GMT-03:00 Adam Plumb <ap...@fiksu.com>:
>>>
>>>> So it's looking like the bloom filter off heap memory usage is ramping
>>>> up and up until the OOM killer kills the java process.  I relaunched on
>>>> instances with 60GB of memory and the same thing is happening.  A node will
>>>> start using more and more RAM until the process is killed, then another
>>>> node will start using more and more until it is also killed.
>>>>
>>>> Is this the expected behavior?  It doesn't seem ideal to me.  Is there
>>>> anything obvious that I'm doing wrong?
>>>>
>>>> On Fri, Mar 11, 2016 at 11:31 AM, Adam Plumb <ap...@fiksu.com> wrote:
>>>>
>>>>> Here is the creation syntax for the entire schema.  The xyz table has
>>>>> about 2.1 billion keys and the def table has about 230 million keys.  Max
>>>>> row size is about 3KB, mean row size is 700B.
>>>>>
>>>>> CREATE KEYSPACE abc WITH replication = {'class':
>>>>>> 'NetworkTopologyStrategy', 'us-east': 3};
>>>>>> CREATE TABLE xyz (
>>>>>>   id text,
>>>>>>   secondary_id int,
>>>>>>   data text,
>>>>>>   PRIMARY KEY(id)
>>>>>> )
>>>>>>   WITH
>>>>>>   compaction = { 'class': 'LeveledCompactionStrategy' }
>>>>>>   and compression = {'class': 'LZ4Compressor'};
>>>>>> CREATE INDEX secondary_id_index ON abc.xyz (secondary_id);
>>>>>> CREATE TABLE def (
>>>>>>   id text,
>>>>>>   secondary_id int,
>>>>>>   data text,
>>>>>>   PRIMARY KEY(id)
>>>>>> )
>>>>>>   WITH
>>>>>>   compaction = { 'class': 'LeveledCompactionStrategy' }
>>>>>>   and compression = {'class': 'LZ4Compressor'};
>>>>>> CREATE INDEX secondary_id_index_def ON abc.def (secondary_id);
>>>>>
>>>>>
>>>>> On Fri, Mar 11, 2016 at 11:24 AM, Jack Krupansky <
>>>>> jack.krupansky@gmail.com> wrote:
>>>>>
>>>>>> What is your schema and data like - in particular, how wide are your
>>>>>> partitions (number of rows and typical row size)?
>>>>>>
>>>>>> Maybe you just need (a lot) more heap for rows during the repair
>>>>>> process.
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> On Fri, Mar 11, 2016 at 11:19 AM, Adam Plumb <ap...@fiksu.com>
>>>>>> wrote:
>>>>>>
>>>>>>> These are brand new boxes only running Cassandra.  Yeah the kernel
>>>>>>> is what is killing the JVM, and this does appear to be a memory leak in
>>>>>>> Cassandra.  And Cassandra is the only thing running, aside from the basic
>>>>>>> services needed for Amazon Linux to run.
>>>>>>>
>>>>>>> On Fri, Mar 11, 2016 at 11:17 AM, Sebastian Estevez <
>>>>>>> sebastian.estevez@datastax.com> wrote:
>>>>>>>
>>>>>>>> Sacrifice child in dmesg is your OS killing the process with the
>>>>>>>> most ram. That means you're actually running out of memory at the Linux
>>>>>>>> level outside of the JVM.
>>>>>>>>
>>>>>>>> Are you running anything other than Cassandra on this box?
>>>>>>>>
>>>>>>>> If so, does it have a memory leak?
>>>>>>>>
>>>>>>>> all the best,
>>>>>>>>
>>>>>>>> Sebastián
>>>>>>>> On Mar 11, 2016 11:14 AM, "Adam Plumb" <ap...@fiksu.com> wrote:
>>>>>>>>
>>>>>>>>> I've got a new cluster of 18 nodes running Cassandra 3.4 that I
>>>>>>>>> just launched and loaded data into yesterday (roughly 2TB of total storage)
>>>>>>>>> and am seeing runaway memory usage.  These nodes are EC2 c3.4xlarges with
>>>>>>>>> 30GB RAM and the heap size is set to 8G with a new heap size of 1.6G.
>>>>>>>>>
>>>>>>>>> Last night I finished loading up the data, then ran an incremental
>>>>>>>>> repair on one of the nodes just to ensure that everything was working
>>>>>>>>> (nodetool repair).  Over night all 18 nodes ran out of memory and were
>>>>>>>>> killed by the OOM killer.  I restarted them this morning and they all came
>>>>>>>>> up fine, but just started churning through memory and got killed again.  I
>>>>>>>>> restarted them again and they're doing the same thing.  I'm not getting any
>>>>>>>>> errors in the system log, since the process is getting killed abruptly
>>>>>>>>> (which makes me think this is a native memory issue, not heap)
>>>>>>>>>
>>>>>>>>> Obviously this behavior isn't the best.  I'm willing to provide
>>>>>>>>> any data people need to help debug this, these nodes are still up and
>>>>>>>>> running.  I'm also in IRC if anyone wants to jump on there.
>>>>>>>>>
>>>>>>>>> Here is the output of ps aux:
>>>>>>>>>
>>>>>>>>> 497       64351  108 89.5 187156072 27642988 ?  SLl  15:13  62:15
>>>>>>>>>> java -ea -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities
>>>>>>>>>> -XX:ThreadPriorityPolicy=42 -Xms7536M -Xmx7536M -Xmn1600M
>>>>>>>>>> -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003
>>>>>>>>>> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>>>>>>>>>> -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
>>>>>>>>>> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
>>>>>>>>>> -XX:+UseTLAB -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=45
>>>>>>>>>> -XX:-ParallelRefProcEnabled -XX:-AlwaysPreTouch -XX:+UseBiasedLocking
>>>>>>>>>> -XX:+UseTLAB -XX:+ResizeTLAB -Djava.net.preferIPv4Stack=true
>>>>>>>>>> -Dcom.sun.management.jmxremote.port=7199
>>>>>>>>>> -Dcom.sun.management.jmxremote.rmi.port=7199
>>>>>>>>>> -Dcom.sun.management.jmxremote.ssl=false
>>>>>>>>>> -Dcom.sun.management.jmxremote.authenticate=false
>>>>>>>>>> -XX:+CMSClassUnloadingEnabled -Dlogback.configurationFile=logback.xml -D
>>>>>>>>>> *cas*sandra.logdir=/usr/local/*cas*sandra/logs -D*cas*
>>>>>>>>>> sandra.storagedir=/usr/local/*cas*sandra/data -D*cas*
>>>>>>>>>> sandra-pidfile=/var/run/*cas*sandra/*cas*sandra.pid -cp
>>>>>>>>>> /usr/local/*cas*sandra/conf:/usr/local/*cas*
>>>>>>>>>> sandra/build/classes/main:/usr/local/*cas*
>>>>>>>>>> sandra/build/classes/thrift:/usr/local/*cas*
>>>>>>>>>> sandra/lib/airline-0.6.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/antlr-runtime-3.5.2.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/apache-*cas*sandra-3.4.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/apache-*cas*sandra-clientutil-3.4.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/apache-*cas*sandra-thrift-3.4.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/asm-5.0.4.jar:/usr/local/*cas*sandra/lib/*cas*
>>>>>>>>>> sandra-driver-core-3.0.0-shaded.jar:/usr/local/*ca*
>>>>>>>>>> sandra/lib/commons-cli-1.1.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/commons-codec-1.2.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/commons-lang3-3.1.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/commons-math3-3.2.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/compress-lzf-0.8.4.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/concurrentlinkedhashmap-lru-1.4.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/concurrent-trees-2.4.0.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/disruptor-3.0.1.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/ecj-4.4.2.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/guava-18.0.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/high-scale-lib-1.0.6.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/hppc-0.5.4.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/jackson-core-asl-1.9.2.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/jamm-0.3.0.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/javax.inject.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/jbcrypt-0.3m.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/jcl-over-slf4j-1.7.7.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/jflex-1.6.0.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/jna-4.0.0.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/joda-time-2.4.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/json-simple-1.1.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/libthrift-0.9.2.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/log4j-over-slf4j-1.7.7.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/logback-classic-1.1.3.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/logback-core-1.1.3.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/lz4-1.3.0.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/metrics-core-3.1.0.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/metrics-logback-3.1.0.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/netty-all-4.0.23.Final.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/ohc-core-0.4.2.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/ohc-core-j8-0.4.2.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/primitive-1.0.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/reporter-config3-3.0.0.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/reporter-config-base-3.0.0.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/sigar-1.6.4.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/slf4j-api-1.7.7.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/snakeyaml-1.11.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/snappy-java-1.1.1.7.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/snowball-stemmer-1.3.0.581.1.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/ST4-4.0.8.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/stream-2.5.2.jar:/usr/local/*cas*
>>>>>>>>>> sandra/lib/thrift-server-0.3.7.jar:/usr/local/*cas*sandra/lib/jsr223/*/*.jar
>>>>>>>>>> org.apache.*cas*sandra.service.CassandraDaemon
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Here is some dmesg output:
>>>>>>>>>
>>>>>>>>> [40003.010117] java invoked oom-killer: gfp_mask=0x280da, order=0,
>>>>>>>>> oom_score_adj=0
>>>>>>>>> [40003.013042] java cpuset=/ mems_allowed=0
>>>>>>>>> [40003.014789] CPU: 3 PID: 37757 Comm: java Tainted: G
>>>>>>>>> E   4.1.7-15.23.amzn1.x86_64 #1
>>>>>>>>> [40003.017852] Hardware name: Xen HVM domU, BIOS 4.2.amazon
>>>>>>>>> 12/07/2015
>>>>>>>>> [40003.020066]  0000000000000000 ffff8800ebaaba18 ffffffff814da12c
>>>>>>>>> 0000000000000000
>>>>>>>>> [40003.022870]  ffff880763594c80 ffff8800ebaabac8 ffffffff814d7939
>>>>>>>>> ffff8800ebaaba78
>>>>>>>>> [40003.025674]  ffffffff811bf8f7 ffff880770679c00 ffff88077001c190
>>>>>>>>> 0000000000000080
>>>>>>>>> [40003.028660] Call Trace:
>>>>>>>>> [40003.029613]  [<ffffffff814da12c>] dump_stack+0x45/0x57
>>>>>>>>> [40003.031486]  [<ffffffff814d7939>] dump_header+0x7f/0x1fe
>>>>>>>>> [40003.033390]  [<ffffffff811bf8f7>] ? mem_cgroup_iter+0x137/0x3d0
>>>>>>>>> [40003.035475]  [<ffffffff8107f496>] ? __queue_work+0x136/0x320
>>>>>>>>> [40003.037594]  [<ffffffff8115d11c>] oom_kill_process+0x1cc/0x3b0
>>>>>>>>> [40003.039825]  [<ffffffff8115d67e>] __out_of_memory+0x31e/0x530
>>>>>>>>> [40003.041938]  [<ffffffff8115da2b>] out_of_memory+0x5b/0x80
>>>>>>>>> [40003.043857]  [<ffffffff81162a79>]
>>>>>>>>> __alloc_pages_nodemask+0x8a9/0x8d0
>>>>>>>>> [40003.046105]  [<ffffffff811a48fa>]
>>>>>>>>> alloc_page_interleave+0x3a/0x90
>>>>>>>>> [40003.048419]  [<ffffffff811a79c3>] alloc_pages_vma+0x143/0x200
>>>>>>>>> [40003.050582]  [<ffffffff81188035>] handle_mm_fault+0x1355/0x1770
>>>>>>>>> [40003.052674]  [<ffffffff8118e4c5>] ? do_mmap_pgoff+0x2f5/0x3c0
>>>>>>>>> [40003.054737]  [<ffffffff8105dafc>] __do_page_fault+0x17c/0x420
>>>>>>>>> [40003.056858]  [<ffffffff8118c976>] ? SyS_mmap_pgoff+0x116/0x270
>>>>>>>>> [40003.059082]  [<ffffffff8105ddc2>] do_page_fault+0x22/0x30
>>>>>>>>> [40003.061084]  [<ffffffff814e2ad8>] page_fault+0x28/0x30
>>>>>>>>> [40003.062938] Mem-Info:
>>>>>>>>> [40003.063762] active_anon:5437903 inactive_anon:1025
>>>>>>>>> isolated_anon:0
>>>>>>>>>  active_file:51 inactive_file:8 isolated_file:0
>>>>>>>>>  unevictable:2088582 dirty:0 writeback:0 unstable:0
>>>>>>>>>  slab_reclaimable:82028 slab_unreclaimable:12209
>>>>>>>>>  mapped:31065 shmem:20 pagetables:37089 bounce:0
>>>>>>>>>  free:35830 free_pcp:3141 free_cma:0
>>>>>>>>> [40003.075549] Node 0 DMA free:15872kB min:8kB low:8kB high:12kB
>>>>>>>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>>>>>>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB
>>>>>>>>> managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
>>>>>>>>> slab_reclaimable:32kB slab_unreclaimable:0kB kernel_stack:0kB
>>>>>>>>> pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
>>>>>>>>> free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
>>>>>>>>> [40003.090267] lowmem_reserve[]: 0 3746 30128 30128
>>>>>>>>> [40003.092182] Node 0 DMA32 free:108236kB min:2756kB low:3444kB
>>>>>>>>> high:4132kB active_anon:2400616kB inactive_anon:4060kB active_file:0kB
>>>>>>>>> inactive_file:0kB unevictable:1049732kB isolated(anon):0kB
>>>>>>>>> isolated(file):0kB present:3915776kB managed:3840296kB mlocked:1049732kB
>>>>>>>>> dirty:4kB writeback:0kB mapped:16564kB shmem:12kB slab_reclaimable:243852kB
>>>>>>>>> slab_unreclaimable:8832kB kernel_stack:1152kB pagetables:16532kB
>>>>>>>>> unstable:0kB bounce:0kB free_pcp:5716kB local_pcp:220kB free_cma:0kB
>>>>>>>>> writeback_tmp:0kB pages_scanned:5408 all_unreclaimable? yes
>>>>>>>>> [40003.108802] lowmem_reserve[]: 0 0 26382 26382
>>>>>>>>> [40003.110578] Node 0 Normal free:19212kB min:19412kB low:24264kB
>>>>>>>>> high:29116kB active_anon:19350996kB inactive_anon:40kB active_file:212kB
>>>>>>>>> inactive_file:80kB unevictable:7304596kB isolated(anon):0kB
>>>>>>>>> isolated(file):0kB present:27525120kB managed:27015196kB mlocked:7304596kB
>>>>>>>>> dirty:0kB writeback:0kB mapped:107696kB shmem:68kB slab_reclaimable:84228kB
>>>>>>>>> slab_unreclaimable:40004kB kernel_stack:10000kB pagetables:131824kB
>>>>>>>>> unstable:0kB bounce:0kB free_pcp:6848kB local_pcp:692kB free_cma:0kB
>>>>>>>>> writeback_tmp:0kB pages_scanned:38332 all_unreclaimable? yes
>>>>>>>>> [40003.128300] lowmem_reserve[]: 0 0 0 0
>>>>>>>>> [40003.129844] Node 0 DMA: 0*4kB 0*8kB 0*16kB 2*32kB (UE) 3*64kB
>>>>>>>>> (UE) 2*128kB (UE) 2*256kB (UE) 1*512kB (E) 2*1024kB (UE) 2*2048kB (ER)
>>>>>>>>> 2*4096kB (M) = 15872kB
>>>>>>>>> [40003.135917] Node 0 DMA32: 193*4kB (UEM) 254*8kB (UEM) 714*16kB
>>>>>>>>> (UE) 1344*32kB (UEMR) 249*64kB (UEMR) 120*128kB (UER) 53*256kB (ER)
>>>>>>>>> 10*512kB (ER) 1*1024kB (E) 0*2048kB 0*4096kB = 108244kB
>>>>>>>>> [40003.142956] Node 0 Normal: 3956*4kB (UE) 0*8kB 1*16kB (R)
>>>>>>>>> 8*32kB (R) 3*64kB (R) 2*128kB (R) 3*256kB (R) 0*512kB 0*1024kB 1*2048kB (R)
>>>>>>>>> 0*4096kB = 19360kB
>>>>>>>>> [40003.148749] Node 0 hugepages_total=0 hugepages_free=0
>>>>>>>>> hugepages_surp=0 hugepages_size=2048kB
>>>>>>>>> [40003.151777] 31304 total pagecache pages
>>>>>>>>> [40003.153288] 0 pages in swap cache
>>>>>>>>> [40003.154528] Swap cache stats: add 0, delete 0, find 0/0
>>>>>>>>> [40003.156377] Free swap  = 0kB
>>>>>>>>> [40003.157423] Total swap = 0kB
>>>>>>>>> [40003.158465] 7864221 pages RAM
>>>>>>>>> [40003.159522] 0 pages HighMem/MovableOnly
>>>>>>>>> [40003.160984] 146372 pages reserved
>>>>>>>>> [40003.162244] [ pid ]   uid  tgid total_vm      rss nr_ptes
>>>>>>>>> nr_pmds swapents oom_score_adj name
>>>>>>>>> [40003.165398] [ 2560]     0  2560     2804      181      11
>>>>>>>>> 3        0         -1000 udevd
>>>>>>>>> [40003.168638] [ 3976]     0  3976     2334      123       9
>>>>>>>>> 3        0             0 dhclient
>>>>>>>>> [40003.171895] [ 4017]     0  4017    11626       89      23
>>>>>>>>> 4        0         -1000 auditd
>>>>>>>>> [40003.175080] [ 4035]     0  4035    61861       99      23
>>>>>>>>> 3        0             0 rsyslogd
>>>>>>>>> [40003.178198] [ 4046]     0  4046     3462       98      10
>>>>>>>>> 3        0             0 irqbalance
>>>>>>>>> [40003.181559] [ 4052]     0  4052     1096       22       7
>>>>>>>>> 3        0             0 rngd
>>>>>>>>> [40003.184683] [ 4067]    32  4067     8815       99      22
>>>>>>>>> 3        0             0 rpcbind
>>>>>>>>> [40003.187772] [ 4084]    29  4084     9957      201      24
>>>>>>>>> 3        0             0 rpc.statd
>>>>>>>>> [40003.191099] [ 4115]    81  4115     5442       60      15
>>>>>>>>> 3        0             0 dbus-daemon
>>>>>>>>> [40003.194438] [ 4333]     0  4333    19452      522      40
>>>>>>>>> 3        0         -1000 sshd
>>>>>>>>> [40003.197432] [ 4361]    38  4361     7321      562      19
>>>>>>>>> 3        0             0 ntpd
>>>>>>>>> [40003.200609] [ 4376]     0  4376    22238      720      46
>>>>>>>>> 3        0             0 sendmail
>>>>>>>>> [40003.203868] [ 4384]    51  4384    20103      674      41
>>>>>>>>> 3        0             0 sendmail
>>>>>>>>> [40003.206963] [ 4515]     0  4515     4267       38      13
>>>>>>>>> 3        0             0 atd
>>>>>>>>> [40003.210100] [ 6730]     0  6730    29888      547      13
>>>>>>>>> 3        0             0 crond
>>>>>>>>> [40003.213267] [13533]   497 13533 47235415  7455314   36074
>>>>>>>>> 167        0             0 java
>>>>>>>>> [40003.216364] [13674]   498 13674    49154     3168      51
>>>>>>>>> 3        0             0 supervisord
>>>>>>>>> [40003.219721] [13680]   498 13680    51046     5350      69
>>>>>>>>> 3        0             0 python
>>>>>>>>> [40003.222908] [13682]   498 13682    36172     5602      75
>>>>>>>>> 3        0             0 python
>>>>>>>>> [40003.225952] [13683]   498 13683    32633     5319      68
>>>>>>>>> 3        0             0 python
>>>>>>>>> [40003.229108] [13684]   498 13684    29577     5003      63
>>>>>>>>> 3        0             0 python
>>>>>>>>> [40003.232263] [13719]   498 13719  1035920    41287     234
>>>>>>>>> 8        0             0 java
>>>>>>>>> [40003.235287] [13753]   498 13753    34605     5645      70
>>>>>>>>> 3        0             0 python
>>>>>>>>> [40003.238322] [14143]     0 14143     1615      420       9
>>>>>>>>> 3        0             0 agetty
>>>>>>>>> [40003.241582] [14145]     0 14145     1078      377       8
>>>>>>>>> 3        0             0 mingetty
>>>>>>>>> [40003.244752] [14147]     0 14147     1078      354       8
>>>>>>>>> 3        0             0 mingetty
>>>>>>>>> [40003.247833] [14149]     0 14149     1078      373       8
>>>>>>>>> 3        0             0 mingetty
>>>>>>>>> [40003.251193] [14151]     0 14151     1078      367       7
>>>>>>>>> 3        0             0 mingetty
>>>>>>>>> [40003.254342] [14153]     0 14153     1078      348       8
>>>>>>>>> 3        0             0 mingetty
>>>>>>>>> [40003.257443] [14154]     0 14154     2803      182      10
>>>>>>>>> 3        0         -1000 udevd
>>>>>>>>> [40003.260688] [14155]     0 14155     2803      182      10
>>>>>>>>> 3        0         -1000 udevd
>>>>>>>>> [40003.263782] [14157]     0 14157     1078      369       8
>>>>>>>>> 3        0             0 mingetty
>>>>>>>>> [40003.266895] Out of memory: Kill process 13533 (java) score 970
>>>>>>>>> or sacrifice child
>>>>>>>>> [40003.269702] Killed process 13533 (java) total-vm:188941660kB,
>>>>>>>>> anon-rss:29710828kB, file-rss:110428kB
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Cassandra causing OOM Killer to strike on new cluster running 3.4

Posted by Satoshi Hikida <sa...@imagine-orb.com>.

Hi, Paulo

Thank you for your help!

I'll try to use it.

Regards,
Satoshi

On Wed, Apr 20, 2016 at 8:43 PM, Paulo Motta <pa...@gmail.com>
wrote:

> 2.2.6 should be released in the next couple of weeks, but I also attached
> a patch file to the issue if you want to patch 2.2.5 manually.
>
> 2016-04-20 0:19 GMT-03:00 Satoshi Hikida <sa...@imagine-orb.com>:
>
>> Hi,
>>
>> I'm looking forward to a  patch (file) for this bug(CASSANDRA-11344) to
>> apply C* version 2.2.5. Is there available patch for that version? I
>> watched link(https://issues.apache.org/jira/browse/CASSANDRA-11344) but
>> couldn't find patch file or something like that.  Or is there any
>> workaround to avoid this bug with version 2.2.5?
>>
>> Regards,
>> Satoshi
>>
>> On Tue, Mar 15, 2016 at 4:47 AM, Adam Plumb <ap...@fiksu.com> wrote:
>>
>>> OK so good news, I'm running with the patched jar file in my cluster and
>>> haven't seen any issues.  The bloom filter off-heap memory usage is between
>>> 1.5GB and 2GB per node, which is much more in-line with what I'm expecting!
>>>  (thumbsup)
>>>
>>> On Mon, Mar 14, 2016 at 9:42 AM, Adam Plumb <ap...@fiksu.com> wrote:
>>>
>>>> Thanks for the link!  Luckily the cluster I'm running is not yet in
>>>> production and running with dummy data so I will throw that jar on the
>>>> nodes and I'll let you know how things shake out.
>>>>
>>>> On Sun, Mar 13, 2016 at 11:02 PM, Paulo Motta <pauloricardomg@gmail.com
>>>> > wrote:
>>>>
>>>>> You could be hitting CASSANDRA-11344 (
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11344).  If that's
>>>>> the case, you may try to replace your cassandra jar on an affected node
>>>>> with a version with this fix in place and force bloom filter regeneration
>>>>> to see if if it fixes your problem. You can build with "ant jar" from this
>>>>> branch: https://github.com/pauloricardomg/cassandra/tree/3.4-11344
>>>>>
>>>>> You can force bloom filter regeneration by either removing your
>>>>> *Filter.db files (make sure to backup them before for safety) or changing
>>>>> the bloom_filter_fp_chance before restarting affected nodes with the fixed
>>>>> jar.
>>>>>
>>>>> 2016-03-13 19:51 GMT-03:00 Adam Plumb <ap...@fiksu.com>:
>>>>>
>>>>>> So it's looking like the bloom filter off heap memory usage is
>>>>>> ramping up and up until the OOM killer kills the java process.  I
>>>>>> relaunched on instances with 60GB of memory and the same thing is
>>>>>> happening.  A node will start using more and more RAM until the process is
>>>>>> killed, then another node will start using more and more until it is also
>>>>>> killed.
>>>>>>
>>>>>> Is this the expected behavior?  It doesn't seem ideal to me.  Is
>>>>>> there anything obvious that I'm doing wrong?
>>>>>>
>>>>>> On Fri, Mar 11, 2016 at 11:31 AM, Adam Plumb <ap...@fiksu.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Here is the creation syntax for the entire schema.  The xyz table
>>>>>>> has about 2.1 billion keys and the def table has about 230 million keys.
>>>>>>> Max row size is about 3KB, mean row size is 700B.
>>>>>>>
>>>>>>> CREATE KEYSPACE abc WITH replication = {'class':
>>>>>>>> 'NetworkTopologyStrategy', 'us-east': 3};
>>>>>>>> CREATE TABLE xyz (
>>>>>>>>   id text,
>>>>>>>>   secondary_id int,
>>>>>>>>   data text,
>>>>>>>>   PRIMARY KEY(id)
>>>>>>>> )
>>>>>>>>   WITH
>>>>>>>>   compaction = { 'class': 'LeveledCompactionStrategy' }
>>>>>>>>   and compression = {'class': 'LZ4Compressor'};
>>>>>>>> CREATE INDEX secondary_id_index ON abc.xyz (secondary_id);
>>>>>>>> CREATE TABLE def (
>>>>>>>>   id text,
>>>>>>>>   secondary_id int,
>>>>>>>>   data text,
>>>>>>>>   PRIMARY KEY(id)
>>>>>>>> )
>>>>>>>>   WITH
>>>>>>>>   compaction = { 'class': 'LeveledCompactionStrategy' }
>>>>>>>>   and compression = {'class': 'LZ4Compressor'};
>>>>>>>> CREATE INDEX secondary_id_index_def ON abc.def (secondary_id);
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 11, 2016 at 11:24 AM, Jack Krupansky <
>>>>>>> jack.krupansky@gmail.com> wrote:
>>>>>>>
>>>>>>>> What is your schema and data like - in particular, how wide are
>>>>>>>> your partitions (number of rows and typical row size)?
>>>>>>>>
>>>>>>>> Maybe you just need (a lot) more heap for rows during the repair
>>>>>>>> process.
>>>>>>>>
>>>>>>>> -- Jack Krupansky
>>>>>>>>
>>>>>>>> On Fri, Mar 11, 2016 at 11:19 AM, Adam Plumb <ap...@fiksu.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> These are brand new boxes only running Cassandra.  Yeah the kernel
>>>>>>>>> is what is killing the JVM, and this does appear to be a memory leak in
>>>>>>>>> Cassandra.  And Cassandra is the only thing running, aside from the basic
>>>>>>>>> services needed for Amazon Linux to run.
>>>>>>>>>
>>>>>>>>> On Fri, Mar 11, 2016 at 11:17 AM, Sebastian Estevez <
>>>>>>>>> sebastian.estevez@datastax.com> wrote:
>>>>>>>>>
>>>>>>>>>> Sacrifice child in dmesg is your OS killing the process with the
>>>>>>>>>> most ram. That means you're actually running out of memory at the Linux
>>>>>>>>>> level outside of the JVM.
>>>>>>>>>>
>>>>>>>>>> Are you running anything other than Cassandra on this box?
>>>>>>>>>>
>>>>>>>>>> If so, does it have a memory leak?
>>>>>>>>>>
>>>>>>>>>> all the best,
>>>>>>>>>>
>>>>>>>>>> Sebastián
>>>>>>>>>> On Mar 11, 2016 11:14 AM, "Adam Plumb" <ap...@fiksu.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I've got a new cluster of 18 nodes running Cassandra 3.4 that I
>>>>>>>>>>> just launched and loaded data into yesterday (roughly 2TB of total storage)
>>>>>>>>>>> and am seeing runaway memory usage.  These nodes are EC2 c3.4xlarges with
>>>>>>>>>>> 30GB RAM and the heap size is set to 8G with a new heap size of 1.6G.
>>>>>>>>>>>
>>>>>>>>>>> Last night I finished loading up the data, then ran an
>>>>>>>>>>> incremental repair on one of the nodes just to ensure that everything was
>>>>>>>>>>> working (nodetool repair).  Over night all 18 nodes ran out of memory and
>>>>>>>>>>> were killed by the OOM killer.  I restarted them this morning and they all
>>>>>>>>>>> came up fine, but just started churning through memory and got killed
>>>>>>>>>>> again.  I restarted them again and they're doing the same thing.  I'm not
>>>>>>>>>>> getting any errors in the system log, since the process is getting killed
>>>>>>>>>>> abruptly (which makes me think this is a native memory issue, not heap)
>>>>>>>>>>>
>>>>>>>>>>> Obviously this behavior isn't the best.  I'm willing to provide
>>>>>>>>>>> any data people need to help debug this, these nodes are still up and
>>>>>>>>>>> running.  I'm also in IRC if anyone wants to jump on there.
>>>>>>>>>>>
>>>>>>>>>>> Here is the output of ps aux:
>>>>>>>>>>>
>>>>>>>>>>> 497       64351  108 89.5 187156072 27642988 ?  SLl  15:13
>>>>>>>>>>>> 62:15 java -ea -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities
>>>>>>>>>>>> -XX:ThreadPriorityPolicy=42 -Xms7536M -Xmx7536M -Xmn1600M
>>>>>>>>>>>> -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003
>>>>>>>>>>>> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>>>>>>>>>>>> -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
>>>>>>>>>>>> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
>>>>>>>>>>>> -XX:+UseTLAB -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=45
>>>>>>>>>>>> -XX:-ParallelRefProcEnabled -XX:-AlwaysPreTouch -XX:+UseBiasedLocking
>>>>>>>>>>>> -XX:+UseTLAB -XX:+ResizeTLAB -Djava.net.preferIPv4Stack=true
>>>>>>>>>>>> -Dcom.sun.management.jmxremote.port=7199
>>>>>>>>>>>> -Dcom.sun.management.jmxremote.rmi.port=7199
>>>>>>>>>>>> -Dcom.sun.management.jmxremote.ssl=false
>>>>>>>>>>>> -Dcom.sun.management.jmxremote.authenticate=false
>>>>>>>>>>>> -XX:+CMSClassUnloadingEnabled -Dlogback.configurationFile=logback.xml -D
>>>>>>>>>>>> *cas*sandra.logdir=/usr/local/*cas*sandra/logs -D*cas*
>>>>>>>>>>>> sandra.storagedir=/usr/local/*cas*sandra/data -D*cas*
>>>>>>>>>>>> sandra-pidfile=/var/run/*cas*sandra/*cas*sandra.pid -cp
>>>>>>>>>>>> /usr/local/*cas*sandra/conf:/usr/local/*cas*
>>>>>>>>>>>> sandra/build/classes/main:/usr/local/*cas*
>>>>>>>>>>>> sandra/build/classes/thrift:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/airline-0.6.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/antlr-runtime-3.5.2.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/apache-*cas*sandra-3.4.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/apache-*cas*sandra-clientutil-3.4.jar:/usr/local/
>>>>>>>>>>>> *cas*sandra/lib/apache-*cas*sandra-thrift-3.4.jar:/usr/local/
>>>>>>>>>>>> *cas*sandra/lib/asm-5.0.4.jar:/usr/local/*cas*sandra/lib/*cas*
>>>>>>>>>>>> sandra-driver-core-3.0.0-shaded.jar:/usr/local/*ca*
>>>>>>>>>>>> sandra/lib/commons-cli-1.1.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/commons-codec-1.2.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/commons-lang3-3.1.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/commons-math3-3.2.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/compress-lzf-0.8.4.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/concurrentlinkedhashmap-lru-1.4.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/concurrent-trees-2.4.0.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/disruptor-3.0.1.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/ecj-4.4.2.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/guava-18.0.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/high-scale-lib-1.0.6.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/hppc-0.5.4.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/jackson-core-asl-1.9.2.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/jamm-0.3.0.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/javax.inject.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/jbcrypt-0.3m.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/jcl-over-slf4j-1.7.7.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/jflex-1.6.0.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/jna-4.0.0.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/joda-time-2.4.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/json-simple-1.1.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/libthrift-0.9.2.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/log4j-over-slf4j-1.7.7.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/logback-classic-1.1.3.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/logback-core-1.1.3.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/lz4-1.3.0.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/metrics-core-3.1.0.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/metrics-logback-3.1.0.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/netty-all-4.0.23.Final.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/ohc-core-0.4.2.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/ohc-core-j8-0.4.2.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/primitive-1.0.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/reporter-config3-3.0.0.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/reporter-config-base-3.0.0.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/sigar-1.6.4.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/slf4j-api-1.7.7.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/snakeyaml-1.11.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/snappy-java-1.1.1.7.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/snowball-stemmer-1.3.0.581.1.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/ST4-4.0.8.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/stream-2.5.2.jar:/usr/local/*cas*
>>>>>>>>>>>> sandra/lib/thrift-server-0.3.7.jar:/usr/local/*cas*sandra/lib/jsr223/*/*.jar
>>>>>>>>>>>> org.apache.*cas*sandra.service.CassandraDaemon
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Here is some dmesg output:
>>>>>>>>>>>
>>>>>>>>>>> [40003.010117] java invoked oom-killer: gfp_mask=0x280da,
>>>>>>>>>>> order=0, oom_score_adj=0
>>>>>>>>>>> [40003.013042] java cpuset=/ mems_allowed=0
>>>>>>>>>>> [40003.014789] CPU: 3 PID: 37757 Comm: java Tainted: G
>>>>>>>>>>>   E   4.1.7-15.23.amzn1.x86_64 #1
>>>>>>>>>>> [40003.017852] Hardware name: Xen HVM domU, BIOS 4.2.amazon
>>>>>>>>>>> 12/07/2015
>>>>>>>>>>> [40003.020066]  0000000000000000 ffff8800ebaaba18
>>>>>>>>>>> ffffffff814da12c 0000000000000000
>>>>>>>>>>> [40003.022870]  ffff880763594c80 ffff8800ebaabac8
>>>>>>>>>>> ffffffff814d7939 ffff8800ebaaba78
>>>>>>>>>>> [40003.025674]  ffffffff811bf8f7 ffff880770679c00
>>>>>>>>>>> ffff88077001c190 0000000000000080
>>>>>>>>>>> [40003.028660] Call Trace:
>>>>>>>>>>> [40003.029613]  [<ffffffff814da12c>] dump_stack+0x45/0x57
>>>>>>>>>>> [40003.031486]  [<ffffffff814d7939>] dump_header+0x7f/0x1fe
>>>>>>>>>>> [40003.033390]  [<ffffffff811bf8f7>] ?
>>>>>>>>>>> mem_cgroup_iter+0x137/0x3d0
>>>>>>>>>>> [40003.035475]  [<ffffffff8107f496>] ? __queue_work+0x136/0x320
>>>>>>>>>>> [40003.037594]  [<ffffffff8115d11c>] oom_kill_process+0x1cc/0x3b0
>>>>>>>>>>> [40003.039825]  [<ffffffff8115d67e>] __out_of_memory+0x31e/0x530
>>>>>>>>>>> [40003.041938]  [<ffffffff8115da2b>] out_of_memory+0x5b/0x80
>>>>>>>>>>> [40003.043857]  [<ffffffff81162a79>]
>>>>>>>>>>> __alloc_pages_nodemask+0x8a9/0x8d0
>>>>>>>>>>> [40003.046105]  [<ffffffff811a48fa>]
>>>>>>>>>>> alloc_page_interleave+0x3a/0x90
>>>>>>>>>>> [40003.048419]  [<ffffffff811a79c3>] alloc_pages_vma+0x143/0x200
>>>>>>>>>>> [40003.050582]  [<ffffffff81188035>]
>>>>>>>>>>> handle_mm_fault+0x1355/0x1770
>>>>>>>>>>> [40003.052674]  [<ffffffff8118e4c5>] ? do_mmap_pgoff+0x2f5/0x3c0
>>>>>>>>>>> [40003.054737]  [<ffffffff8105dafc>] __do_page_fault+0x17c/0x420
>>>>>>>>>>> [40003.056858]  [<ffffffff8118c976>] ? SyS_mmap_pgoff+0x116/0x270
>>>>>>>>>>> [40003.059082]  [<ffffffff8105ddc2>] do_page_fault+0x22/0x30
>>>>>>>>>>> [40003.061084]  [<ffffffff814e2ad8>] page_fault+0x28/0x30
>>>>>>>>>>> [40003.062938] Mem-Info:
>>>>>>>>>>> [40003.063762] active_anon:5437903 inactive_anon:1025
>>>>>>>>>>> isolated_anon:0
>>>>>>>>>>>  active_file:51 inactive_file:8 isolated_file:0
>>>>>>>>>>>  unevictable:2088582 dirty:0 writeback:0 unstable:0
>>>>>>>>>>>  slab_reclaimable:82028 slab_unreclaimable:12209
>>>>>>>>>>>  mapped:31065 shmem:20 pagetables:37089 bounce:0
>>>>>>>>>>>  free:35830 free_pcp:3141 free_cma:0
>>>>>>>>>>> [40003.075549] Node 0 DMA free:15872kB min:8kB low:8kB high:12kB
>>>>>>>>>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>>>>>>>>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB
>>>>>>>>>>> managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
>>>>>>>>>>> slab_reclaimable:32kB slab_unreclaimable:0kB kernel_stack:0kB
>>>>>>>>>>> pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
>>>>>>>>>>> free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
>>>>>>>>>>> [40003.090267] lowmem_reserve[]: 0 3746 30128 30128
>>>>>>>>>>> [40003.092182] Node 0 DMA32 free:108236kB min:2756kB low:3444kB
>>>>>>>>>>> high:4132kB active_anon:2400616kB inactive_anon:4060kB active_file:0kB
>>>>>>>>>>> inactive_file:0kB unevictable:1049732kB isolated(anon):0kB
>>>>>>>>>>> isolated(file):0kB present:3915776kB managed:3840296kB mlocked:1049732kB
>>>>>>>>>>> dirty:4kB writeback:0kB mapped:16564kB shmem:12kB slab_reclaimable:243852kB
>>>>>>>>>>> slab_unreclaimable:8832kB kernel_stack:1152kB pagetables:16532kB
>>>>>>>>>>> unstable:0kB bounce:0kB free_pcp:5716kB local_pcp:220kB free_cma:0kB
>>>>>>>>>>> writeback_tmp:0kB pages_scanned:5408 all_unreclaimable? yes
>>>>>>>>>>> [40003.108802] lowmem_reserve[]: 0 0 26382 26382
>>>>>>>>>>> [40003.110578] Node 0 Normal free:19212kB min:19412kB
>>>>>>>>>>> low:24264kB high:29116kB active_anon:19350996kB inactive_anon:40kB
>>>>>>>>>>> active_file:212kB inactive_file:80kB unevictable:7304596kB
>>>>>>>>>>> isolated(anon):0kB isolated(file):0kB present:27525120kB managed:27015196kB
>>>>>>>>>>> mlocked:7304596kB dirty:0kB writeback:0kB mapped:107696kB shmem:68kB
>>>>>>>>>>> slab_reclaimable:84228kB slab_unreclaimable:40004kB kernel_stack:10000kB
>>>>>>>>>>> pagetables:131824kB unstable:0kB bounce:0kB free_pcp:6848kB local_pcp:692kB
>>>>>>>>>>> free_cma:0kB writeback_tmp:0kB pages_scanned:38332 all_unreclaimable? yes
>>>>>>>>>>> [40003.128300] lowmem_reserve[]: 0 0 0 0
>>>>>>>>>>> [40003.129844] Node 0 DMA: 0*4kB 0*8kB 0*16kB 2*32kB (UE) 3*64kB
>>>>>>>>>>> (UE) 2*128kB (UE) 2*256kB (UE) 1*512kB (E) 2*1024kB (UE) 2*2048kB (ER)
>>>>>>>>>>> 2*4096kB (M) = 15872kB
>>>>>>>>>>> [40003.135917] Node 0 DMA32: 193*4kB (UEM) 254*8kB (UEM)
>>>>>>>>>>> 714*16kB (UE) 1344*32kB (UEMR) 249*64kB (UEMR) 120*128kB (UER) 53*256kB
>>>>>>>>>>> (ER) 10*512kB (ER) 1*1024kB (E) 0*2048kB 0*4096kB = 108244kB
>>>>>>>>>>> [40003.142956] Node 0 Normal: 3956*4kB (UE) 0*8kB 1*16kB (R)
>>>>>>>>>>> 8*32kB (R) 3*64kB (R) 2*128kB (R) 3*256kB (R) 0*512kB 0*1024kB 1*2048kB (R)
>>>>>>>>>>> 0*4096kB = 19360kB
>>>>>>>>>>> [40003.148749] Node 0 hugepages_total=0 hugepages_free=0
>>>>>>>>>>> hugepages_surp=0 hugepages_size=2048kB
>>>>>>>>>>> [40003.151777] 31304 total pagecache pages
>>>>>>>>>>> [40003.153288] 0 pages in swap cache
>>>>>>>>>>> [40003.154528] Swap cache stats: add 0, delete 0, find 0/0
>>>>>>>>>>> [40003.156377] Free swap  = 0kB
>>>>>>>>>>> [40003.157423] Total swap = 0kB
>>>>>>>>>>> [40003.158465] 7864221 pages RAM
>>>>>>>>>>> [40003.159522] 0 pages HighMem/MovableOnly
>>>>>>>>>>> [40003.160984] 146372 pages reserved
>>>>>>>>>>> [40003.162244] [ pid ]   uid  tgid total_vm      rss nr_ptes
>>>>>>>>>>> nr_pmds swapents oom_score_adj name
>>>>>>>>>>> [40003.165398] [ 2560]     0  2560     2804      181      11
>>>>>>>>>>>   3        0         -1000 udevd
>>>>>>>>>>> [40003.168638] [ 3976]     0  3976     2334      123       9
>>>>>>>>>>>   3        0             0 dhclient
>>>>>>>>>>> [40003.171895] [ 4017]     0  4017    11626       89      23
>>>>>>>>>>>   4        0         -1000 auditd
>>>>>>>>>>> [40003.175080] [ 4035]     0  4035    61861       99      23
>>>>>>>>>>>   3        0             0 rsyslogd
>>>>>>>>>>> [40003.178198] [ 4046]     0  4046     3462       98      10
>>>>>>>>>>>   3        0             0 irqbalance
>>>>>>>>>>> [40003.181559] [ 4052]     0  4052     1096       22       7
>>>>>>>>>>>   3        0             0 rngd
>>>>>>>>>>> [40003.184683] [ 4067]    32  4067     8815       99      22
>>>>>>>>>>>   3        0             0 rpcbind
>>>>>>>>>>> [40003.187772] [ 4084]    29  4084     9957      201      24
>>>>>>>>>>>   3        0             0 rpc.statd
>>>>>>>>>>> [40003.191099] [ 4115]    81  4115     5442       60      15
>>>>>>>>>>>   3        0             0 dbus-daemon
>>>>>>>>>>> [40003.194438] [ 4333]     0  4333    19452      522      40
>>>>>>>>>>>   3        0         -1000 sshd
>>>>>>>>>>> [40003.197432] [ 4361]    38  4361     7321      562      19
>>>>>>>>>>>   3        0             0 ntpd
>>>>>>>>>>> [40003.200609] [ 4376]     0  4376    22238      720      46
>>>>>>>>>>>   3        0             0 sendmail
>>>>>>>>>>> [40003.203868] [ 4384]    51  4384    20103      674      41
>>>>>>>>>>>   3        0             0 sendmail
>>>>>>>>>>> [40003.206963] [ 4515]     0  4515     4267       38      13
>>>>>>>>>>>   3        0             0 atd
>>>>>>>>>>> [40003.210100] [ 6730]     0  6730    29888      547      13
>>>>>>>>>>>   3        0             0 crond
>>>>>>>>>>> [40003.213267] [13533]   497 13533 47235415  7455314   36074
>>>>>>>>>>> 167        0             0 java
>>>>>>>>>>> [40003.216364] [13674]   498 13674    49154     3168      51
>>>>>>>>>>>   3        0             0 supervisord
>>>>>>>>>>> [40003.219721] [13680]   498 13680    51046     5350      69
>>>>>>>>>>>   3        0             0 python
>>>>>>>>>>> [40003.222908] [13682]   498 13682    36172     5602      75
>>>>>>>>>>>   3        0             0 python
>>>>>>>>>>> [40003.225952] [13683]   498 13683    32633     5319      68
>>>>>>>>>>>   3        0             0 python
>>>>>>>>>>> [40003.229108] [13684]   498 13684    29577     5003      63
>>>>>>>>>>>   3        0             0 python
>>>>>>>>>>> [40003.232263] [13719]   498 13719  1035920    41287     234
>>>>>>>>>>>   8        0             0 java
>>>>>>>>>>> [40003.235287] [13753]   498 13753    34605     5645      70
>>>>>>>>>>>   3        0             0 python
>>>>>>>>>>> [40003.238322] [14143]     0 14143     1615      420       9
>>>>>>>>>>>   3        0             0 agetty
>>>>>>>>>>> [40003.241582] [14145]     0 14145     1078      377       8
>>>>>>>>>>>   3        0             0 mingetty
>>>>>>>>>>> [40003.244752] [14147]     0 14147     1078      354       8
>>>>>>>>>>>   3        0             0 mingetty
>>>>>>>>>>> [40003.247833] [14149]     0 14149     1078      373       8
>>>>>>>>>>>   3        0             0 mingetty
>>>>>>>>>>> [40003.251193] [14151]     0 14151     1078      367       7
>>>>>>>>>>>   3        0             0 mingetty
>>>>>>>>>>> [40003.254342] [14153]     0 14153     1078      348       8
>>>>>>>>>>>   3        0             0 mingetty
>>>>>>>>>>> [40003.257443] [14154]     0 14154     2803      182      10
>>>>>>>>>>>   3        0         -1000 udevd
>>>>>>>>>>> [40003.260688] [14155]     0 14155     2803      182      10
>>>>>>>>>>>   3        0         -1000 udevd
>>>>>>>>>>> [40003.263782] [14157]     0 14157     1078      369       8
>>>>>>>>>>>   3        0             0 mingetty
>>>>>>>>>>> [40003.266895] Out of memory: Kill process 13533 (java) score
>>>>>>>>>>> 970 or sacrifice child
>>>>>>>>>>> [40003.269702] Killed process 13533 (java) total-vm:188941660kB,
>>>>>>>>>>> anon-rss:29710828kB, file-rss:110428kB
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Cassandra causing OOM Killer to strike on new cluster running 3.4

Posted by Paulo Motta <pa...@gmail.com>.

2.2.6 should be released in the next couple of weeks, but I also attached a
patch file to the issue if you want to patch 2.2.5 manually.

2016-04-20 0:19 GMT-03:00 Satoshi Hikida <sa...@imagine-orb.com>:

> Hi,
>
> I'm looking forward to a  patch (file) for this bug(CASSANDRA-11344) to
> apply C* version 2.2.5. Is there available patch for that version? I
> watched link(https://issues.apache.org/jira/browse/CASSANDRA-11344) but
> couldn't find patch file or something like that.  Or is there any
> workaround to avoid this bug with version 2.2.5?
>
> Regards,
> Satoshi
>
> On Tue, Mar 15, 2016 at 4:47 AM, Adam Plumb <ap...@fiksu.com> wrote:
>
>> OK so good news, I'm running with the patched jar file in my cluster and
>> haven't seen any issues.  The bloom filter off-heap memory usage is between
>> 1.5GB and 2GB per node, which is much more in-line with what I'm expecting!
>>  (thumbsup)
>>
>> On Mon, Mar 14, 2016 at 9:42 AM, Adam Plumb <ap...@fiksu.com> wrote:
>>
>>> Thanks for the link!  Luckily the cluster I'm running is not yet in
>>> production and running with dummy data so I will throw that jar on the
>>> nodes and I'll let you know how things shake out.
>>>
>>> On Sun, Mar 13, 2016 at 11:02 PM, Paulo Motta <pa...@gmail.com>
>>> wrote:
>>>
>>>> You could be hitting CASSANDRA-11344 (
>>>> https://issues.apache.org/jira/browse/CASSANDRA-11344).  If that's the
>>>> case, you may try to replace your cassandra jar on an affected node with a
>>>> version with this fix in place and force bloom filter regeneration to see
>>>> if if it fixes your problem. You can build with "ant jar" from this branch:
>>>> https://github.com/pauloricardomg/cassandra/tree/3.4-11344
>>>>
>>>> You can force bloom filter regeneration by either removing your
>>>> *Filter.db files (make sure to backup them before for safety) or changing
>>>> the bloom_filter_fp_chance before restarting affected nodes with the fixed
>>>> jar.
>>>>
>>>> 2016-03-13 19:51 GMT-03:00 Adam Plumb <ap...@fiksu.com>:
>>>>
>>>>> So it's looking like the bloom filter off heap memory usage is ramping
>>>>> up and up until the OOM killer kills the java process.  I relaunched on
>>>>> instances with 60GB of memory and the same thing is happening.  A node will
>>>>> start using more and more RAM until the process is killed, then another
>>>>> node will start using more and more until it is also killed.
>>>>>
>>>>> Is this the expected behavior?  It doesn't seem ideal to me.  Is there
>>>>> anything obvious that I'm doing wrong?
>>>>>
>>>>> On Fri, Mar 11, 2016 at 11:31 AM, Adam Plumb <ap...@fiksu.com> wrote:
>>>>>
>>>>>> Here is the creation syntax for the entire schema.  The xyz table has
>>>>>> about 2.1 billion keys and the def table has about 230 million keys.  Max
>>>>>> row size is about 3KB, mean row size is 700B.
>>>>>>
>>>>>> CREATE KEYSPACE abc WITH replication = {'class':
>>>>>>> 'NetworkTopologyStrategy', 'us-east': 3};
>>>>>>> CREATE TABLE xyz (
>>>>>>>   id text,
>>>>>>>   secondary_id int,
>>>>>>>   data text,
>>>>>>>   PRIMARY KEY(id)
>>>>>>> )
>>>>>>>   WITH
>>>>>>>   compaction = { 'class': 'LeveledCompactionStrategy' }
>>>>>>>   and compression = {'class': 'LZ4Compressor'};
>>>>>>> CREATE INDEX secondary_id_index ON abc.xyz (secondary_id);
>>>>>>> CREATE TABLE def (
>>>>>>>   id text,
>>>>>>>   secondary_id int,
>>>>>>>   data text,
>>>>>>>   PRIMARY KEY(id)
>>>>>>> )
>>>>>>>   WITH
>>>>>>>   compaction = { 'class': 'LeveledCompactionStrategy' }
>>>>>>>   and compression = {'class': 'LZ4Compressor'};
>>>>>>> CREATE INDEX secondary_id_index_def ON abc.def (secondary_id);
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 11, 2016 at 11:24 AM, Jack Krupansky <
>>>>>> jack.krupansky@gmail.com> wrote:
>>>>>>
>>>>>>> What is your schema and data like - in particular, how wide are your
>>>>>>> partitions (number of rows and typical row size)?
>>>>>>>
>>>>>>> Maybe you just need (a lot) more heap for rows during the repair
>>>>>>> process.
>>>>>>>
>>>>>>> -- Jack Krupansky
>>>>>>>
>>>>>>> On Fri, Mar 11, 2016 at 11:19 AM, Adam Plumb <ap...@fiksu.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> These are brand new boxes only running Cassandra.  Yeah the kernel
>>>>>>>> is what is killing the JVM, and this does appear to be a memory leak in
>>>>>>>> Cassandra.  And Cassandra is the only thing running, aside from the basic
>>>>>>>> services needed for Amazon Linux to run.
>>>>>>>>
>>>>>>>> On Fri, Mar 11, 2016 at 11:17 AM, Sebastian Estevez <
>>>>>>>> sebastian.estevez@datastax.com> wrote:
>>>>>>>>
>>>>>>>>> Sacrifice child in dmesg is your OS killing the process with the
>>>>>>>>> most ram. That means you're actually running out of memory at the Linux
>>>>>>>>> level outside of the JVM.
>>>>>>>>>
>>>>>>>>> Are you running anything other than Cassandra on this box?
>>>>>>>>>
>>>>>>>>> If so, does it have a memory leak?
>>>>>>>>>
>>>>>>>>> all the best,
>>>>>>>>>
>>>>>>>>> Sebastián
>>>>>>>>> On Mar 11, 2016 11:14 AM, "Adam Plumb" <ap...@fiksu.com> wrote:
>>>>>>>>>
>>>>>>>>>> I've got a new cluster of 18 nodes running Cassandra 3.4 that I
>>>>>>>>>> just launched and loaded data into yesterday (roughly 2TB of total storage)
>>>>>>>>>> and am seeing runaway memory usage.  These nodes are EC2 c3.4xlarges with
>>>>>>>>>> 30GB RAM and the heap size is set to 8G with a new heap size of 1.6G.
>>>>>>>>>>
>>>>>>>>>> Last night I finished loading up the data, then ran an
>>>>>>>>>> incremental repair on one of the nodes just to ensure that everything was
>>>>>>>>>> working (nodetool repair).  Over night all 18 nodes ran out of memory and
>>>>>>>>>> were killed by the OOM killer.  I restarted them this morning and they all
>>>>>>>>>> came up fine, but just started churning through memory and got killed
>>>>>>>>>> again.  I restarted them again and they're doing the same thing.  I'm not
>>>>>>>>>> getting any errors in the system log, since the process is getting killed
>>>>>>>>>> abruptly (which makes me think this is a native memory issue, not heap)
>>>>>>>>>>
>>>>>>>>>> Obviously this behavior isn't the best.  I'm willing to provide
>>>>>>>>>> any data people need to help debug this, these nodes are still up and
>>>>>>>>>> running.  I'm also in IRC if anyone wants to jump on there.
>>>>>>>>>>
>>>>>>>>>> Here is the output of ps aux:
>>>>>>>>>>
>>>>>>>>>> 497       64351  108 89.5 187156072 27642988 ?  SLl  15:13  62:15
>>>>>>>>>>> java -ea -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities
>>>>>>>>>>> -XX:ThreadPriorityPolicy=42 -Xms7536M -Xmx7536M -Xmn1600M
>>>>>>>>>>> -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003
>>>>>>>>>>> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>>>>>>>>>>> -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
>>>>>>>>>>> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
>>>>>>>>>>> -XX:+UseTLAB -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=45
>>>>>>>>>>> -XX:-ParallelRefProcEnabled -XX:-AlwaysPreTouch -XX:+UseBiasedLocking
>>>>>>>>>>> -XX:+UseTLAB -XX:+ResizeTLAB -Djava.net.preferIPv4Stack=true
>>>>>>>>>>> -Dcom.sun.management.jmxremote.port=7199
>>>>>>>>>>> -Dcom.sun.management.jmxremote.rmi.port=7199
>>>>>>>>>>> -Dcom.sun.management.jmxremote.ssl=false
>>>>>>>>>>> -Dcom.sun.management.jmxremote.authenticate=false
>>>>>>>>>>> -XX:+CMSClassUnloadingEnabled -Dlogback.configurationFile=logback.xml -D
>>>>>>>>>>> *cas*sandra.logdir=/usr/local/*cas*sandra/logs -D*cas*
>>>>>>>>>>> sandra.storagedir=/usr/local/*cas*sandra/data -D*cas*
>>>>>>>>>>> sandra-pidfile=/var/run/*cas*sandra/*cas*sandra.pid -cp
>>>>>>>>>>> /usr/local/*cas*sandra/conf:/usr/local/*cas*
>>>>>>>>>>> sandra/build/classes/main:/usr/local/*cas*
>>>>>>>>>>> sandra/build/classes/thrift:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/airline-0.6.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/antlr-runtime-3.5.2.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/apache-*cas*sandra-3.4.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/apache-*cas*sandra-clientutil-3.4.jar:/usr/local/
>>>>>>>>>>> *cas*sandra/lib/apache-*cas*sandra-thrift-3.4.jar:/usr/local/
>>>>>>>>>>> *cas*sandra/lib/asm-5.0.4.jar:/usr/local/*cas*sandra/lib/*cas*
>>>>>>>>>>> sandra-driver-core-3.0.0-shaded.jar:/usr/local/*ca*
>>>>>>>>>>> sandra/lib/commons-cli-1.1.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/commons-codec-1.2.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/commons-lang3-3.1.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/commons-math3-3.2.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/compress-lzf-0.8.4.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/concurrentlinkedhashmap-lru-1.4.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/concurrent-trees-2.4.0.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/disruptor-3.0.1.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/ecj-4.4.2.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/guava-18.0.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/high-scale-lib-1.0.6.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/hppc-0.5.4.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/jackson-core-asl-1.9.2.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/jamm-0.3.0.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/javax.inject.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/jbcrypt-0.3m.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/jcl-over-slf4j-1.7.7.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/jflex-1.6.0.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/jna-4.0.0.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/joda-time-2.4.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/json-simple-1.1.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/libthrift-0.9.2.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/log4j-over-slf4j-1.7.7.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/logback-classic-1.1.3.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/logback-core-1.1.3.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/lz4-1.3.0.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/metrics-core-3.1.0.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/metrics-logback-3.1.0.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/netty-all-4.0.23.Final.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/ohc-core-0.4.2.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/ohc-core-j8-0.4.2.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/primitive-1.0.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/reporter-config3-3.0.0.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/reporter-config-base-3.0.0.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/sigar-1.6.4.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/slf4j-api-1.7.7.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/snakeyaml-1.11.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/snappy-java-1.1.1.7.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/snowball-stemmer-1.3.0.581.1.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/ST4-4.0.8.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/stream-2.5.2.jar:/usr/local/*cas*
>>>>>>>>>>> sandra/lib/thrift-server-0.3.7.jar:/usr/local/*cas*sandra/lib/jsr223/*/*.jar
>>>>>>>>>>> org.apache.*cas*sandra.service.CassandraDaemon
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Here is some dmesg output:
>>>>>>>>>>
>>>>>>>>>> [40003.010117] java invoked oom-killer: gfp_mask=0x280da,
>>>>>>>>>> order=0, oom_score_adj=0
>>>>>>>>>> [40003.013042] java cpuset=/ mems_allowed=0
>>>>>>>>>> [40003.014789] CPU: 3 PID: 37757 Comm: java Tainted: G
>>>>>>>>>> E   4.1.7-15.23.amzn1.x86_64 #1
>>>>>>>>>> [40003.017852] Hardware name: Xen HVM domU, BIOS 4.2.amazon
>>>>>>>>>> 12/07/2015
>>>>>>>>>> [40003.020066]  0000000000000000 ffff8800ebaaba18
>>>>>>>>>> ffffffff814da12c 0000000000000000
>>>>>>>>>> [40003.022870]  ffff880763594c80 ffff8800ebaabac8
>>>>>>>>>> ffffffff814d7939 ffff8800ebaaba78
>>>>>>>>>> [40003.025674]  ffffffff811bf8f7 ffff880770679c00
>>>>>>>>>> ffff88077001c190 0000000000000080
>>>>>>>>>> [40003.028660] Call Trace:
>>>>>>>>>> [40003.029613]  [<ffffffff814da12c>] dump_stack+0x45/0x57
>>>>>>>>>> [40003.031486]  [<ffffffff814d7939>] dump_header+0x7f/0x1fe
>>>>>>>>>> [40003.033390]  [<ffffffff811bf8f7>] ? mem_cgroup_iter+0x137/0x3d0
>>>>>>>>>> [40003.035475]  [<ffffffff8107f496>] ? __queue_work+0x136/0x320
>>>>>>>>>> [40003.037594]  [<ffffffff8115d11c>] oom_kill_process+0x1cc/0x3b0
>>>>>>>>>> [40003.039825]  [<ffffffff8115d67e>] __out_of_memory+0x31e/0x530
>>>>>>>>>> [40003.041938]  [<ffffffff8115da2b>] out_of_memory+0x5b/0x80
>>>>>>>>>> [40003.043857]  [<ffffffff81162a79>]
>>>>>>>>>> __alloc_pages_nodemask+0x8a9/0x8d0
>>>>>>>>>> [40003.046105]  [<ffffffff811a48fa>]
>>>>>>>>>> alloc_page_interleave+0x3a/0x90
>>>>>>>>>> [40003.048419]  [<ffffffff811a79c3>] alloc_pages_vma+0x143/0x200
>>>>>>>>>> [40003.050582]  [<ffffffff81188035>] handle_mm_fault+0x1355/0x1770
>>>>>>>>>> [40003.052674]  [<ffffffff8118e4c5>] ? do_mmap_pgoff+0x2f5/0x3c0
>>>>>>>>>> [40003.054737]  [<ffffffff8105dafc>] __do_page_fault+0x17c/0x420
>>>>>>>>>> [40003.056858]  [<ffffffff8118c976>] ? SyS_mmap_pgoff+0x116/0x270
>>>>>>>>>> [40003.059082]  [<ffffffff8105ddc2>] do_page_fault+0x22/0x30
>>>>>>>>>> [40003.061084]  [<ffffffff814e2ad8>] page_fault+0x28/0x30
>>>>>>>>>> [40003.062938] Mem-Info:
>>>>>>>>>> [40003.063762] active_anon:5437903 inactive_anon:1025
>>>>>>>>>> isolated_anon:0
>>>>>>>>>>  active_file:51 inactive_file:8 isolated_file:0
>>>>>>>>>>  unevictable:2088582 dirty:0 writeback:0 unstable:0
>>>>>>>>>>  slab_reclaimable:82028 slab_unreclaimable:12209
>>>>>>>>>>  mapped:31065 shmem:20 pagetables:37089 bounce:0
>>>>>>>>>>  free:35830 free_pcp:3141 free_cma:0
>>>>>>>>>> [40003.075549] Node 0 DMA free:15872kB min:8kB low:8kB high:12kB
>>>>>>>>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>>>>>>>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB
>>>>>>>>>> managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
>>>>>>>>>> slab_reclaimable:32kB slab_unreclaimable:0kB kernel_stack:0kB
>>>>>>>>>> pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
>>>>>>>>>> free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
>>>>>>>>>> [40003.090267] lowmem_reserve[]: 0 3746 30128 30128
>>>>>>>>>> [40003.092182] Node 0 DMA32 free:108236kB min:2756kB low:3444kB
>>>>>>>>>> high:4132kB active_anon:2400616kB inactive_anon:4060kB active_file:0kB
>>>>>>>>>> inactive_file:0kB unevictable:1049732kB isolated(anon):0kB
>>>>>>>>>> isolated(file):0kB present:3915776kB managed:3840296kB mlocked:1049732kB
>>>>>>>>>> dirty:4kB writeback:0kB mapped:16564kB shmem:12kB slab_reclaimable:243852kB
>>>>>>>>>> slab_unreclaimable:8832kB kernel_stack:1152kB pagetables:16532kB
>>>>>>>>>> unstable:0kB bounce:0kB free_pcp:5716kB local_pcp:220kB free_cma:0kB
>>>>>>>>>> writeback_tmp:0kB pages_scanned:5408 all_unreclaimable? yes
>>>>>>>>>> [40003.108802] lowmem_reserve[]: 0 0 26382 26382
>>>>>>>>>> [40003.110578] Node 0 Normal free:19212kB min:19412kB low:24264kB
>>>>>>>>>> high:29116kB active_anon:19350996kB inactive_anon:40kB active_file:212kB
>>>>>>>>>> inactive_file:80kB unevictable:7304596kB isolated(anon):0kB
>>>>>>>>>> isolated(file):0kB present:27525120kB managed:27015196kB mlocked:7304596kB
>>>>>>>>>> dirty:0kB writeback:0kB mapped:107696kB shmem:68kB slab_reclaimable:84228kB
>>>>>>>>>> slab_unreclaimable:40004kB kernel_stack:10000kB pagetables:131824kB
>>>>>>>>>> unstable:0kB bounce:0kB free_pcp:6848kB local_pcp:692kB free_cma:0kB
>>>>>>>>>> writeback_tmp:0kB pages_scanned:38332 all_unreclaimable? yes
>>>>>>>>>> [40003.128300] lowmem_reserve[]: 0 0 0 0
>>>>>>>>>> [40003.129844] Node 0 DMA: 0*4kB 0*8kB 0*16kB 2*32kB (UE) 3*64kB
>>>>>>>>>> (UE) 2*128kB (UE) 2*256kB (UE) 1*512kB (E) 2*1024kB (UE) 2*2048kB (ER)
>>>>>>>>>> 2*4096kB (M) = 15872kB
>>>>>>>>>> [40003.135917] Node 0 DMA32: 193*4kB (UEM) 254*8kB (UEM) 714*16kB
>>>>>>>>>> (UE) 1344*32kB (UEMR) 249*64kB (UEMR) 120*128kB (UER) 53*256kB (ER)
>>>>>>>>>> 10*512kB (ER) 1*1024kB (E) 0*2048kB 0*4096kB = 108244kB
>>>>>>>>>> [40003.142956] Node 0 Normal: 3956*4kB (UE) 0*8kB 1*16kB (R)
>>>>>>>>>> 8*32kB (R) 3*64kB (R) 2*128kB (R) 3*256kB (R) 0*512kB 0*1024kB 1*2048kB (R)
>>>>>>>>>> 0*4096kB = 19360kB
>>>>>>>>>> [40003.148749] Node 0 hugepages_total=0 hugepages_free=0
>>>>>>>>>> hugepages_surp=0 hugepages_size=2048kB
>>>>>>>>>> [40003.151777] 31304 total pagecache pages
>>>>>>>>>> [40003.153288] 0 pages in swap cache
>>>>>>>>>> [40003.154528] Swap cache stats: add 0, delete 0, find 0/0
>>>>>>>>>> [40003.156377] Free swap  = 0kB
>>>>>>>>>> [40003.157423] Total swap = 0kB
>>>>>>>>>> [40003.158465] 7864221 pages RAM
>>>>>>>>>> [40003.159522] 0 pages HighMem/MovableOnly
>>>>>>>>>> [40003.160984] 146372 pages reserved
>>>>>>>>>> [40003.162244] [ pid ]   uid  tgid total_vm      rss nr_ptes
>>>>>>>>>> nr_pmds swapents oom_score_adj name
>>>>>>>>>> [40003.165398] [ 2560]     0  2560     2804      181      11
>>>>>>>>>>   3        0         -1000 udevd
>>>>>>>>>> [40003.168638] [ 3976]     0  3976     2334      123       9
>>>>>>>>>>   3        0             0 dhclient
>>>>>>>>>> [40003.171895] [ 4017]     0  4017    11626       89      23
>>>>>>>>>>   4        0         -1000 auditd
>>>>>>>>>> [40003.175080] [ 4035]     0  4035    61861       99      23
>>>>>>>>>>   3        0             0 rsyslogd
>>>>>>>>>> [40003.178198] [ 4046]     0  4046     3462       98      10
>>>>>>>>>>   3        0             0 irqbalance
>>>>>>>>>> [40003.181559] [ 4052]     0  4052     1096       22       7
>>>>>>>>>>   3        0             0 rngd
>>>>>>>>>> [40003.184683] [ 4067]    32  4067     8815       99      22
>>>>>>>>>>   3        0             0 rpcbind
>>>>>>>>>> [40003.187772] [ 4084]    29  4084     9957      201      24
>>>>>>>>>>   3        0             0 rpc.statd
>>>>>>>>>> [40003.191099] [ 4115]    81  4115     5442       60      15
>>>>>>>>>>   3        0             0 dbus-daemon
>>>>>>>>>> [40003.194438] [ 4333]     0  4333    19452      522      40
>>>>>>>>>>   3        0         -1000 sshd
>>>>>>>>>> [40003.197432] [ 4361]    38  4361     7321      562      19
>>>>>>>>>>   3        0             0 ntpd
>>>>>>>>>> [40003.200609] [ 4376]     0  4376    22238      720      46
>>>>>>>>>>   3        0             0 sendmail
>>>>>>>>>> [40003.203868] [ 4384]    51  4384    20103      674      41
>>>>>>>>>>   3        0             0 sendmail
>>>>>>>>>> [40003.206963] [ 4515]     0  4515     4267       38      13
>>>>>>>>>>   3        0             0 atd
>>>>>>>>>> [40003.210100] [ 6730]     0  6730    29888      547      13
>>>>>>>>>>   3        0             0 crond
>>>>>>>>>> [40003.213267] [13533]   497 13533 47235415  7455314   36074
>>>>>>>>>> 167        0             0 java
>>>>>>>>>> [40003.216364] [13674]   498 13674    49154     3168      51
>>>>>>>>>>   3        0             0 supervisord
>>>>>>>>>> [40003.219721] [13680]   498 13680    51046     5350      69
>>>>>>>>>>   3        0             0 python
>>>>>>>>>> [40003.222908] [13682]   498 13682    36172     5602      75
>>>>>>>>>>   3        0             0 python
>>>>>>>>>> [40003.225952] [13683]   498 13683    32633     5319      68
>>>>>>>>>>   3        0             0 python
>>>>>>>>>> [40003.229108] [13684]   498 13684    29577     5003      63
>>>>>>>>>>   3        0             0 python
>>>>>>>>>> [40003.232263] [13719]   498 13719  1035920    41287     234
>>>>>>>>>>   8        0             0 java
>>>>>>>>>> [40003.235287] [13753]   498 13753    34605     5645      70
>>>>>>>>>>   3        0             0 python
>>>>>>>>>> [40003.238322] [14143]     0 14143     1615      420       9
>>>>>>>>>>   3        0             0 agetty
>>>>>>>>>> [40003.241582] [14145]     0 14145     1078      377       8
>>>>>>>>>>   3        0             0 mingetty
>>>>>>>>>> [40003.244752] [14147]     0 14147     1078      354       8
>>>>>>>>>>   3        0             0 mingetty
>>>>>>>>>> [40003.247833] [14149]     0 14149     1078      373       8
>>>>>>>>>>   3        0             0 mingetty
>>>>>>>>>> [40003.251193] [14151]     0 14151     1078      367       7
>>>>>>>>>>   3        0             0 mingetty
>>>>>>>>>> [40003.254342] [14153]     0 14153     1078      348       8
>>>>>>>>>>   3        0             0 mingetty
>>>>>>>>>> [40003.257443] [14154]     0 14154     2803      182      10
>>>>>>>>>>   3        0         -1000 udevd
>>>>>>>>>> [40003.260688] [14155]     0 14155     2803      182      10
>>>>>>>>>>   3        0         -1000 udevd
>>>>>>>>>> [40003.263782] [14157]     0 14157     1078      369       8
>>>>>>>>>>   3        0             0 mingetty
>>>>>>>>>> [40003.266895] Out of memory: Kill process 13533 (java) score 970
>>>>>>>>>> or sacrifice child
>>>>>>>>>> [40003.269702] Killed process 13533 (java) total-vm:188941660kB,
>>>>>>>>>> anon-rss:29710828kB, file-rss:110428kB
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>