You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jose Martinez Poblete (JIRA)" <ji...@apache.org> on 2014/11/11 23:30:35 UTC

[jira] [Issue Comment Deleted] (CASSANDRA-8295) Cassandra runs OOM @ java.util.concurrent.ConcurrentSkipListMap$HeadIndex

     [ https://issues.apache.org/jira/browse/CASSANDRA-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jose Martinez Poblete updated CASSANDRA-8295:
---------------------------------------------
    Comment: was deleted

(was: More info from MAT

{noformat}
Class Name	Objects	Shallow Heap
java.nio.HeapByteBuffer
First 10 of 73,845,620 objects	73,845,620	3,544,589,760
edu.stanford.ppl.concurrent.SnapTreeMap$Node
First 10 of 34,614,044 objects	34,614,044	1,661,474,112
byte[]
First 10 of 3,969,475 objects	3,969,475	1,510,362,528
org.apache.cassandra.db.Column
First 10 of 34,614,043 objects	34,614,043	1,107,649,376
edu.stanford.ppl.concurrent.CopyOnWriteManager$COWEpoch
First 10 of 411,924 objects	411,924	39,544,704
java.nio.ByteBuffer[]
First 10 of 823,848 objects	823,848	30,913,568
long[]
First 10 of 411,924 objects	411,924	22,819,304
edu.stanford.ppl.concurrent.SnapTreeMap$RootHolder
First 10 of 411,924 objects	411,924	19,772,352
org.apache.cassandra.db.RangeTombstoneList
First 10 of 411,924 objects	411,924	16,476,960
int[]
First 10 of 411,924 objects	411,924	15,456,784
edu.stanford.ppl.concurrent.CopyOnWriteManager$Latch
First 10 of 411,924 objects	411,924	13,181,568
edu.stanford.ppl.concurrent.SnapTreeMap
First 10 of 411,924 objects	411,924	13,181,568
java.util.concurrent.atomic.AtomicReference
First 10 of 823,848 objects	823,848	13,181,568
java.util.concurrent.ConcurrentSkipListMap$Node
First 10 of 411,929 objects	411,929	9,886,296
org.apache.cassandra.db.DecoratedKey
First 10 of 411,928 objects	411,928	9,886,272
java.lang.Long
First 10 of 411,928 objects	411,928	9,886,272
org.apache.cassandra.db.AtomicSortedColumns
First 10 of 411,924 objects	411,924	9,886,176
org.apache.cassandra.db.AtomicSortedColumns$Holder
First 10 of 411,924 objects	411,924	9,886,176
org.apache.cassandra.db.DeletionInfo
First 10 of 411,924 objects	411,924	9,886,176
org.apache.cassandra.dht.LongToken
First 10 of 411,928 objects	411,928	6,590,848
edu.stanford.ppl.concurrent.SnapTreeMap$COWMgr
First 10 of 411,924 objects	411,924	6,590,784
java.util.concurrent.ConcurrentSkipListMap$Index
First 10 of 207,065 objects	207,065	4,969,560
java.util.concurrent.ConcurrentSkipListMap$HeadIndex
First 10 of 16 objects	16	512
org.apache.cassandra.db.DeletedColumn
All 1 objects	1	32

Total: 24 entries
155,076,837	8,086,073,256
{noformat})

> Cassandra runs OOM @ java.util.concurrent.ConcurrentSkipListMap$HeadIndex
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8295
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8295
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: DSE 4.5.3 Cassandra 2.0.11.82
>            Reporter: Jose Martinez Poblete
>         Attachments: alln01-ats-cas3.cassandra.yaml, output.tgz, system.tgz, system.tgz.1, system.tgz.2, system.tgz.3
>
>
> Customer runs a 3 node cluster 
> Their dataset is less than 1Tb and during data load, one of the nodes enter a GC death spiral:
> {noformat}
>  INFO [ScheduledTasks:1] 2014-11-07 23:31:08,094 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 3348 ms for 2 collections, 1658268944 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:40:58,486 GCInspector.java (line 116) GC for ParNew: 442 ms for 2 collections, 6079570032 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:40:58,487 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 7351 ms for 2 collections, 6084678280 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:41:01,836 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 603 ms for 1 collections, 7132546096 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:41:09,626 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 761 ms for 1 collections, 7286946984 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:41:15,265 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 703 ms for 1 collections, 7251213520 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:41:25,027 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 1205 ms for 1 collections, 6507586104 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:41:41,374 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 13835 ms for 3 collections, 6514187192 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-07 23:41:54,137 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 6834 ms for 2 collections, 6521656200 used; max is 8375238656
> ...
>  INFO [ScheduledTasks:1] 2014-11-08 12:13:11,086 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 43967 ms for 2 collections, 8368777672 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-08 12:14:14,151 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 63968 ms for 3 collections, 8369623824 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-08 12:14:55,643 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 41307 ms for 2 collections, 8370115376 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-08 12:20:06,197 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 309634 ms for 15 collections, 8374994928 used; max is 8375238656
>  INFO [ScheduledTasks:1] 2014-11-08 13:07:33,617 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 2681100 ms for 143 collections, 8347631560 used; max is 8375238656
> {noformat} 
> Their application waits 1 minute before a retry when a timeout is returned
> This is what we find on their heapdumps:
> {noformat}
> Class Name                                                                                                                                                                                                                                                                                               | Shallow Heap | Retained Heap | Percentage
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> org.apache.cassandra.db.Memtable @ 0x773f52f80                                                                                                                                                                                                                                                           |           72 | 8,086,073,504 |     96.66%
> |- java.util.concurrent.ConcurrentSkipListMap @ 0x724508fe8                                                                                                                                                                                                                                              |           48 | 8,086,073,320 |     96.66%
> |  |- java.util.concurrent.ConcurrentSkipListMap$HeadIndex @ 0x64f9219a0                                                                                                                                                                                                                                 |           32 | 8,086,073,256 |     96.66%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x614b081a8                                                                                                                                                                                                                                   |           24 |    16,230,976 |      0.19%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x7da171948                                                                                                                                                                                                                                   |           24 |     4,922,288 |      0.06%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x7f4518a80                                                                                                                                                                                                                                   |           24 |     4,405,496 |      0.05%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x611d69d10                                                                                                                                                                                                                                   |           24 |     3,737,672 |      0.04%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x71cd2fae8                                                                                                                                                                                                                                   |           24 |     2,921,048 |      0.03%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$HeadIndex @ 0x728faed50                                                                                                                                                                                                                              |           32 |     2,012,592 |      0.02%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x6387eb950                                                                                                                                                                                                                                   |           24 |     1,641,696 |      0.02%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x727f474f0                                                                                                                                                                                                                                   |           24 |     1,328,936 |      0.02%
> |  |  |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x70d7a02b0                                                                                                                                                                                                                                   |           24 |     1,050,624 |      0.01%
> |  |  |- byte[1048576] @ 0x7d87873d8  .........8.........CS.l`...attributes...slot..............attributes...runtime......A..<x.........C.......attributes...procgid.87.....CS.`....attributes...bflush.00.....CV......attributes...username........uV....server.f1432541.........8...server......A..<...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x60ab7b920  .....7...p...attributes...tottime....../..%....area......0.......attributes...lineid.56.....7.i.....attributes...tottime.156258924.....0B)\....container.4...../.......server....,PTXCALsdihqprod1\sdihqprod1...../.......machine.fxcdom1.....7.i.....attributes...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x609fb54f8  .....E.......attributes...lineid.901137423.....E.......attributes...testr1.1413.....E.......attributes...testr2.M393B1K70QB0-YK02014-01-03 06:46:31.....E.......attributes...tenum1name.EFSTLOOP.....CV......attributes...numunits.1.....E.......attributes...pa...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x60a0b5508  .....E.z.....area......?.......attributes...labelnum.SYSFA.....0"U.....attributes...testr1name.D75165799...../..^....attributes...crc.Hexload_Bootloader.....E.TR....machine......0.......attributes...bflush....../..&....attributes...majline....../._.P...att...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7d8f5e2b8  ......B9.....machine.solfr5.......L.....attributes...runtime.146.............attributes...tottime.109.......t.h...attributes...bmap.0......B9.....uuttype.VIP2-40=.......L.....attributes...cpptimeid.2006-04-11 10:53:48.............attributes...partnum.73-91...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7d905e2c8  .....E.|.x...attributes...runtime.310.....E.P.0...attributes...partnum.15-13637-02.....E./<....area.SYSFA.....E./<....passfail.S.....E.|.x...attributes...testr1.1413.....E.P.0...attributes...partnum2.15-13637-02.....E./<....container....T.....E./<....attri...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7d915e2d8  ...../..l........../..l....server.sdihqprod1\sdihqprod1...../..l....machine.fxcdom1...../..l....uuttype.73-12304-03...../..l....area.PASTE...../..l....passfail.P...../..l....container........../..l....attributes...majline.0...../..l....attributes...subslot...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7d925e2e8  .............attributes...tottime.42........KH...attributes...testtime.........3....uuttype.0.............attributes...runtime..............attributes...procgid.73-9341-021417817........3....area.PASTE.............attributes...test.PASSED........3....passf...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7be7473f0  .....=..Jh.........=...(...attributes...runtime.f6f1298f-830f-47f4-b1dd-1adb07b99ff9653.....=..*(...attributes...numunits.1.....=..Jh...server......=.._....attributes...testr3name.RCDN9HQPROD1\RCDN9HQPROD1CPPVersion:3.6.2803.0.....=..*(...attributes...test...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7be847800  .........0...passfail.P........ (...attributes...bflush.0.......w.....attributes...tottime.1161.....A.b.(...area.SYSVF.......kH....attributes...test..............attributes...runtime.PASSED.....A.zl....attributes...procgid.2.........0...container.............|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7be949070  .....=...0...attributes...pcid......=..>....attributes...cpptimeid.6cc40f78-9525-4488-909f-2247d9537cf82013-04-04 19:24:23.....=.Z.....attributes...runtime.0.....=...0...attributes...testr3name.CPPVersion:3.6.2803.0.....=.............=...0...attributes...p...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7bea4a8e0  .....>{A0....uuttype.FJZPROD1\FJZPROD1.....>z.Mp...attributes...pcid......Ct..(...attributes...lineid......=..n8...container......=.oE..........4B......machine.F2049802CBLSTB-4044066-K9fxhmcekit2.....=.p(h...attributes...proctime......>z..`...machine.........|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7beb4a8f0  .....A../....attributes...partnum2......B'......attributes...username....D.....B'.L....area.f1303257.....A...P...area.74-8071-01F118190965553.....A.......server.PCBDLSYSPM.......$.....attributes...runtime......>{r+....attributes...cpptimeid......B.B. ...co...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7bec4a900  .....=..;x...attributes...username.tczpawe73-100074-01.....=..;x...attributes...slot.0.....=.......area.ASSY.....=..;x...attributes...lineid......=.......passfail.0P.....=.......container..........=..;x...attributes...numunits.1.....=.......attributes...pa...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x7bed55ff0  .....=oC/....machine......"..Q....uuttype......"...`...attributes...parentsernum.73-8479-02FCZ133171DPfxcestgfqa1....."`..x...attributes...test......=l.2p...attributes...tenum3.8242009070919300730FOC13283D6A....."..Q....area......"...(...attributes...bflus...|    1,048,592 |     1,048,592 |      0.01%
> |  |  |- byte[1048576] @ 0x61cf45088  .....CSaL....server.FXCPROD1\FXCPROD1.....CW......passfail.P.....CSr.....attributes...runtime.50.....CSaL....machine.foxchict217.....CW......container..........CSaL....uuttype......CW......attributes...username.73-13315-03xzhang.....CSr.....attributes...te...|    1,048,592 |     1,048,592 |      0.01%
> |  |  '- Total: 25 of 166,289 entries; 166,264 more                                                                                                                                                                                                                                                      |              |               |           
> |  |- java.util.concurrent.ConcurrentSkipListMap$EntrySet @ 0x72541dc58                                                                                                                                                                                                                                  |           16 |            16 |      0.00%
> |  '- Total: 2 entries                                                                                                                                                                                                                                                                                   |              |               |           
> |- org.github.jamm.MemoryMeter @ 0x72541db50                                                                                                                                                                                                                                                             |           24 |            40 |      0.00%
> |- java.util.concurrent.atomic.AtomicLong @ 0x72541db68                                                                                                                                                                                                                                                  |           24 |            24 |      0.00%
> |- java.util.concurrent.atomic.AtomicLong @ 0x72541db80                                                                                                                                                                                                                                                  |           24 |            24 |      0.00%
> |- java.util.concurrent.atomic.AtomicLong @ 0x72541db38                                                                                                                                                                                                                                                  |           24 |            24 |      0.00%
> '- Total: 5 entries                                                                                                                                                                                                                                                                                      |              |               |           
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> {noformat}
> They are using the defaults at cassandra.yaml which means sstables should not use that much heap.  Setting the following have been of no use:
> {noformat}
> memtable_total_space_in_mb: 2000
> memtable_flush_queue_size: 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)