You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Thomas Steinmaurer (Jira)" <ji...@apache.org> on 2019/11/06 22:19:00 UTC

[jira] [Comment Edited] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster

    [ https://issues.apache.org/jira/browse/CASSANDRA-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968762#comment-16968762 ] 

Thomas Steinmaurer edited comment on CASSANDRA-15400 at 11/6/19 10:18 PM:
--------------------------------------------------------------------------

[~marcuse], the data model has evolved starting with Astyanax/Thrift moved over to pure CQL3 access (without real data migration), but still with our own application-side serializer framework, working with byte buffers, thus BLOBs on the data model side.

Our high volume (usually > 1TByte per node, RF=3) CF/table looks like that, where we also see the majority of increasing number of pending compaction tasks, according to a per-CF JMX based self-monitoring:
{noformat}
CREATE TABLE ks.cf1 (
    k blob,
    n blob,
    v blob,
    PRIMARY KEY (k, n)
) WITH COMPACT STORAGE
...
;
{noformat}
Although we tend to also have single partitions in the area of > 100MByte, e.g. visible due to according compaction logs in the Cassandra log, all not being a real problem in practice with a heap of Xms/Xmx12G resp Xmn3G and Cas 2.1.

A few additional thoughts:
 * Likely the Cassandra node is utilizing most of the compaction threads (4 in this scenario with the m5.2xlarge instance type) with larger compactions on streamed data, giving less room for compactions of live data / actual writes while being in UJ, resulting in accessing much more smaller SSTables (looks like we have/had plenty in the area of 10-50MByte) then in UN starting to serve read requests
 * Is there anything known in Cas 3.0, which might result in streaming more data from other nodes compared to 2.1 resulting in increased compaction work to be done for newly joined nodes compared to 2.1
 * Is there anything known in Cas 3.0, which results in more frequent memtable flushes compared to 2.1, again resulting in increased compaction work
 * Talking about a single {{BigTableReader}} instance again, did anything change in regard to byte buffer pre-allocation at 1MByte in {{StatsMetadata}} per data member {{minClusteringValues}} and {{maxClusteringValues}} as shown in the hprof? Looks to me we potentially waste quite some on-heap memory here 
 !cassandra_hprof_statsmetadata.png|width=800!
* Is {{StatsMetadata}} purely on-heap? Or is it somehow pulled from off-heap first resulting in the 1MByte allocation, reminding me a bit on the NIO cache buffer bug (https://support.datastax.com/hc/en-us/articles/360000863663-JVM-OOM-direct-buffer-errors-affected-by-unlimited-java-nio-cache), with a recommendation setting it to exactly the number (-Djdk.nio.maxCachedBufferSize=1048576) we see in the hprof for the on-heap byte buffer

Number of compaction threads, compaction throttling is unchanged during the upgrade from 2.1 to 3.0 and if memory serves me well, we should see improved compaction throughput in 3.0 with the same throttling settings anyway.


was (Author: tsteinmaurer):
[~marcuse], the data model has evolved starting with Astyanax/Thrift moved over to pure CQL3 access (without real data migration), but still with our own application-side serializer framework, working with byte buffers, thus BLOBs on the data model side.

Our high volume (usually > 1TByte per node, RF=3) CF/table looks like that, where we also see the majority of increasing number of pending compaction tasks, according to a per-CF JMX based self-monitoring:
{noformat}
CREATE TABLE ks.cf1 (
    k blob,
    n blob,
    v blob,
    PRIMARY KEY (k, n)
) WITH COMPACT STORAGE
...
;
{noformat}
Although we tend to also have single partitions in the area of > 100MByte, e.g. visible due to according compaction logs in the Cassandra log, all not being a real problem in practice with a heap of Xms/Xmx12G resp Xmn3G and Cas 2.1.

A few additional thoughts:
 * Likely the Cassandra node is utilizing most of the compaction threads (4 in this scenario with the m5.2xlarge instance type) with larger compactions on streamed data, giving less room for compactions of live data / actual writes while being in UJ, resulting in accessing much more smaller SSTables (looks like we have/had plenty in the area of 10-50MByte) then in UN starting to serve read requests
 * Is there anything known in Cas 3.0, which might result in streaming more data from other nodes compared to 2.1 resulting in increased compaction work to be done for newly joined nodes compared to 2.1
 * Is there anything known in Cas 3.0, which results in more frequent memtable flushes compared to 2.1, again resulting in increased compaction work
 * Talking about a single {{BigTableReader}} instance again, did anything change in regard to byte buffer pre-allocation at 1MByte in {{StatsMetadata}} per data member {{minClusteringValues}} and {{maxClusteringValues}} as shown in the hprof? Looks to me we potentially waste quite some on-heap memory here 
 !cassandra_hprof_statsmetadata.png|width=800!
* Is {{StatsMetadata}} purely on-heap? Or is it somehow pulled from off-heap first resulting in the 1MByte allocation, reminding me a bit on the NIO cache buffer bug (https://support.datastax.com/hc/en-us/articles/360000863663-JVM-OOM-direct-buffer-errors-affected-by-unlimited-java-nio-cache), with a recommendation setting it to exactly the number (-Djdk.nio.maxCachedBufferSize=1048576) we see in the hprof for the on-heap byte buffer

Number of compaction threads, compaction throttling is unchanged during the upgrade from 2.1 to 3.0.

> Cassandra 3.0.18 went OOM several hours after joining a cluster
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-15400
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15400
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Thomas Steinmaurer
>            Assignee: Blake Eggleston
>            Priority: Normal
>         Attachments: cassandra_hprof_bigtablereader_statsmetadata.png, cassandra_hprof_dominator_classes.png, cassandra_hprof_statsmetadata.png, cassandra_jvm_metrics.png, cassandra_operationcount.png, cassandra_sstables_pending_compactions.png
>
>
> We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been facing an OOM two times with 3.0.18 on newly added nodes joining an existing cluster after several hours being successfully bootstrapped.
> Running in AWS:
> * m5.2xlarge, EBS SSD (gp2)
> * Xms/Xmx12G, Xmn3G, CMS GC
> * 4 compaction threads, throttling set to 32 MB/s
> What we see is a steady increase in the OLD gen over many hours.
> !cassandra_jvm_metrics.png!
> * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00
> * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct 31 ~ 07:00 also starting to be a member of serving client read requests
> !cassandra_operationcount.png!
> Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage constantly increased.
> We see a correlation in increased number of SSTables and pending compactions.
> !cassandra_sstables_pending_compactions.png!
> Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra startup (metric gap in the chart above), number of SSTables + pending compactions is still high, but without facing memory troubles since then.
> This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K BigTableReader instances with ~ 8.7GByte retained heap in total.
> !cassandra_hprof_dominator_classes.png!
> Having a closer look on a single object instance, seems like each instance is ~ 2MByte in size.
> !cassandra_hprof_bigtablereader_statsmetadata.png!
> With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 MByte each
> We have been running with 2.1.18 for > 3 years and I can't remember dealing with such OOM in the context of extending a cluster.
> While the MAT screens above are from our production cluster, we partly can reproduce this behavior in our loadtest environment (although not going full OOM there), thus I might be able to share a hprof from this non-prod environment if needed.
> Thanks a lot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org