You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Paulo Motta (JIRA)" <ji...@apache.org> on 2015/12/11 13:43:11 UTC

[jira] [Comment Edited] (CASSANDRA-10797) Bootstrap new node fails with OOM when streaming nodes contains thousands of sstables

    [ https://issues.apache.org/jira/browse/CASSANDRA-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15052692#comment-15052692 ] 

Paulo Motta edited comment on CASSANDRA-10797 at 12/11/15 12:43 PM:
--------------------------------------------------------------------

As mentioned before I was able to reproduce the OOM with 1000 small sstables and 50M heap. I attached a [ccm cluster|https://issues.apache.org/jira/secure/attachment/12777032/dtest.tar.gz] with 2 nodes. In order to reproduce, extract the {{dtest.tar.gz}} in the {{~/.ccm}} folder, update the following properties to match your local directories on {{dtest/node*/conf/cassandra.yaml}}: {{commitlog_directory}}, {{data_file_directories}} and {{saved_caches_directory}}. After that, run the following commands:
{noformat}
ccm switch dtest
ccm node1 start
sleep 10
ccm node2 start (will throw OOM)
{noformat}

The main problem is that all {{SStableWriters}} remain open until the end of the stream receive task, and these objects are quite large with indexes and stats that are written to disk only when the {{SStableWriters}} are closed. 

Before CASSANDRA-6503, {{SStableWriters}} were closed as soon as they were received, and the stream receive task kept only the {{SStableReaders}} which have a much smaller memory footprint. The main reason to defer the closing of the {{SStableWriter}} to the end of the stream receive task was to keep sstables temporary (with {{-tmp}} infix), avoiding stale sstables to reappear if the machines are restarted after a failed repair session. A discussed alternative was to close the {{SStableWriter}} without removing the {{-tmp}} infix, and performing an atomic rename in the end of the stream task. However, this alternative was disregarded as the {{SStableReader}} would need to be closed and reopened in order to perform the atomic rename on non-posix systems such as Windows.

CASSANDRA-6503 also introduced the {{StreamLockFile}} to remove already-closed {{SStableWriters}} if the node goes down before these files are processed in the end of the stream receive task. So, the proposed solution basically returns to the previous behavior of closing {{SStableWriters}} as soon as they are received, while adding already-closed-but-not-yet-live files to the {{StreamLockFile}}. As soon as the sstables are added to the data tracker, the {{StreamLockFile}} is removed. If the stream session fails before that, the already-closed-but-not-yet-live sstables are cleaned up. If there is a failure while adding files to the data tracker, only the files that were not yet added to the data tracker are removed since they were already live. If the node goes down during a stream session, the already-closed-but-not-yet-live sstables present in the {{StreamLockFile}} are removed on the next startup as done today.

Since {{StreamLockFile}} is a much more critical component with this approach, I added unit tests to verify that {{append}}, {{cleanup}}, {{skip}} and {{delete}} are working correctly. We also need to ignore sstables that are present on a {{StreamLockFile}} during {{nodetool refresh}}. I will do that after first review if this approach is validated.

Below are some test results with and without the patch, with constrained (50M) and unconstrained (500M) memory.



||*||unpatched||patched||
||constrained|!10797-nonpatched.png!|!10797-patched.png!|
||unconstrained|!10798-nonpatched-500M.png!|!10798-patched-500M.png!|

In the constrained case, the unpatched version OOM soon after starting bootstrap while the patched version finished bootstrap successfully. In the unconstrained case, the memory footprint is between 1/2 to 1/3 smaller, but the difference is probably much larger in the case of large sstables.

I will provide 2.2+ versions after review.


was (Author: pauloricardomg):
As mentioned before I was able to reproduce the OOM with 1000 small sstables and 50M heap. I attached a [ccm cluster|https://issues.apache.org/jira/secure/attachment/12777032/dtest.tar.gz] with 2 nodes. In order to reproduce, extract the {{dtest.tar.gz}} in the {{~/.ccm}} folder, update the following properties to match your local directories on {{dtest/node*/conf/cassandra.yaml}}: {{commitlog_directory}}, {{data_file_directories}} and {{saved_caches_directory}}. After that, run the following commands:
{noformat}
ccm switch dtest
ccm node1 start
sleep 10
ccm node2 start (will throw OOM)
{noformat}

The main problem is that all {{SStableWriters}} remain open until the end of the stream receive task, and these objects are quite large with indexes and stats that are written to disk only when the {{SStableWriters}} are closed. 

Before CASSANDRA-6503, {{SStableWriters}} were closed as soon as they were received, and the stream receive task kept only the {{SStableReaders}} which have a much smaller memory footprint. The main reason to defer the closing of the {{SStableWriter}} to the end of the stream receive task was to keep sstables temporary (with {{-tmp}} infix), avoiding stale sstables to reappear if the machines are restarted after a failed repair session. A discussed alternative was to close the {{SStableWriter}} without removing the {{-tmp}} infix, and performing an atomic rename in the end of the stream task. However, this alternative was disregarded as the {{SStableReader}} would need to be closed and reopened in order to perform the atomic rename on non-posix systems such as Windows.

CASSANDRA-6503 also introduced the {{StreamLockFile}} to remove already-closed {{SStableWriters}} if the node goes down before these files are processed in the end of the stream receive task. So, the proposed solution basically returns to the previous behavior of closing {{SStableWriters}} as soon as they are received, while adding already-closed-but-not-yet-live files to the {{StreamLockFile}}. As soon as the sstables are added to the data tracker, the {{StreamLockFile}} is removed. If the stream session fails before that, the already-closed-but-not-yet-live sstables are cleaned up. If there is a failure while adding files to the data tracker, only the files that were not yet added to the data tracker are removed since they were already live. If the node goes down during a stream session, the already-closed-but-not-yet-live sstables present in the {{StreamLockFile}} are removed on the next startup as done today.

Since {{StreamLockFile}} is a much more critical component with this approach, I added unit tests to verify that {{append}}, {{cleanup}}, {{skip}} and {{delete}} are working correctly. We also need to ignore sstables that are present on a {{StreamLockFile}} during {{nodetool refresh}}. I will do that after first review if 

> Bootstrap new node fails with OOM when streaming nodes contains thousands of sstables
> -------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10797
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10797
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>         Environment: Cassandra 2.1.8.621 w/G1GC
>            Reporter: Jose Martinez Poblete
>            Assignee: Paulo Motta
>             Fix For: 2.1.x
>
>         Attachments: 10797-nonpatched.png, 10797-patched.png, 10798-nonpatched-500M.png, 10798-patched-500M.png, 112415_system.log, Heapdump_OOM.zip, Screen Shot 2015-12-01 at 7.34.40 PM.png, dtest.tar.gz
>
>
> When adding a new node to an existing DC, it runs OOM after 25-45 minutes
> Upon heapdump revision, it is found the sending nodes are streaming thousands of sstables which in turns blows the bootstrapping node heap 
> {noformat}
> ERROR [RMI Scheduler(0)] 2015-11-24 10:10:44,585 JVMStabilityInspector.java:94 - JVM state determined to be unstable.  Exiting forcefully due to:
> java.lang.OutOfMemoryError: Java heap space
> ERROR [STREAM-IN-/173.36.28.148] 2015-11-24 10:10:44,585 StreamSession.java:502 - [Stream #0bb13f50-92cb-11e5-bc8d-f53b7528ffb4] Streaming error occurred
> java.lang.IllegalStateException: Shutdown in progress
>         at java.lang.ApplicationShutdownHooks.remove(ApplicationShutdownHooks.java:82) ~[na:1.8.0_65]
>         at java.lang.Runtime.removeShutdownHook(Runtime.java:239) ~[na:1.8.0_65]
>         at org.apache.cassandra.service.StorageService.removeShutdownHook(StorageService.java:747) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at org.apache.cassandra.utils.JVMStabilityInspector$Killer.killCurrentJVM(JVMStabilityInspector.java:95) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at org.apache.cassandra.utils.JVMStabilityInspector.inspectThrowable(JVMStabilityInspector.java:64) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:66) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:38) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:55) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:250) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_65]
> ERROR [RMI TCP Connection(idle)] 2015-11-24 10:10:44,585 JVMStabilityInspector.java:94 - JVM state determined to be unstable.  Exiting forcefully due to:
> java.lang.OutOfMemoryError: Java heap space
> ERROR [OptionalTasks:1] 2015-11-24 10:10:44,585 CassandraDaemon.java:223 - Exception in thread Thread[OptionalTasks:1,5,main]
> java.lang.IllegalStateException: Shutdown in progress
> {noformat}
> Attached is the Eclipse MAT report as a zipped web page



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)