You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Michael Fong <mi...@ruckuswireless.com> on 2016/04/20 04:43:21 UTC

Cassandra 2.0.x OOM during bootstrap

Hi, all,

We have recently encountered a Cassandra OOM issue when Cassandra is brought up sometimes (but not always) in our 4-node cluster test bed.

After analyzing the heap dump, we could find the Internal-Response thread pool (JMXEnabledThreadPoolExecutor) is filled with thounds of 'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of heap memory.

According to the documents on internet, it seems internal-response thread pool is somewhat related to schema-checking. Has anyone encountered similar issue before?

We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance!

Sincerely,

Michael Fong

RE: Cassandra 2.0.x OOM during bootstrap

Posted by Michael Fong <mi...@ruckuswireless.com>.

Hi, all,

Here is some more information on before the OOM happened on the rebooted node in a 2-node test cluster:


1.       It seems the schema version has changed on the rebooted node after reboot, i.e.
Before reboot,
Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 MigrationManager.java (line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 MigrationManager.java (line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f

After rebooting node 2,
Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b



2.       After reboot, both nods repeatedly send MigrationTask to each other - we suspect it is related to the schema version (Digest) mismatch after Node 2 rebooted:
The node2  keeps submitting the migration task over 100+ times to the other node.
INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node /192.168.88.33 has restarted, now UP
INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating topology for /192.168.88.33
INFO [GossipStage:1] 2016-04-19 11:18:18,263 StorageService.java (line 1544) Node /192.168.88.33 state jump to normal
INFO [GossipStage:1] 2016-04-19 11:18:18,264 TokenMetadata.java (line 414) Updating topology for /192.168.88.33
DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33 is down.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33 is down.
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33
INFO [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33
INFO [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33
INFO [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP
DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,356 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
.....


On the otherhand, Node 1 keeps updating its gossip information, followed by receiving and submitting migrationTask afterwards:
DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,332 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34
INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP
DEBUG [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34
INFO [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP
DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34
INFO [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP
......
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,595 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,843 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,878 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
......
DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34
DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34
DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34
.....

Has anyone experienced this scenario? Thanks in advanced!

Sincerely,

Michael Fong

From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
Sent: Wednesday, April 20, 2016 10:43 AM
To: user@cassandra.apache.org; dev@cassandra.apache.org
Subject: Cassandra 2.0.x OOM during bootstrap

Hi, all,

We have recently encountered a Cassandra OOM issue when Cassandra is brought up sometimes (but not always) in our 4-node cluster test bed.

After analyzing the heap dump, we could find the Internal-Response thread pool (JMXEnabledThreadPoolExecutor) is filled with thounds of 'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of heap memory.

According to the documents on internet, it seems internal-response thread pool is somewhat related to schema-checking. Has anyone encountered similar issue before?

We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance!

Sincerely,

Michael Fong

RE: Cassandra 2.0.x OOM during bootstrap

Posted by Michael Fong <mi...@ruckuswireless.com>.

Hi, all,

Here is some more information on before the OOM happened on the rebooted node in a 2-node test cluster:


1.       It seems the schema version has changed on the rebooted node after reboot, i.e.
Before reboot,
Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 MigrationManager.java (line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 MigrationManager.java (line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f

After rebooting node 2,
Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b



2.       After reboot, both nods repeatedly send MigrationTask to each other - we suspect it is related to the schema version (Digest) mismatch after Node 2 rebooted:
The node2  keeps submitting the migration task over 100+ times to the other node.
INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node /192.168.88.33 has restarted, now UP
INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating topology for /192.168.88.33
INFO [GossipStage:1] 2016-04-19 11:18:18,263 StorageService.java (line 1544) Node /192.168.88.33 state jump to normal
INFO [GossipStage:1] 2016-04-19 11:18:18,264 TokenMetadata.java (line 414) Updating topology for /192.168.88.33
DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33 is down.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33 is down.
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33
INFO [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33
INFO [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33
INFO [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP
DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,356 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
.....


On the otherhand, Node 1 keeps updating its gossip information, followed by receiving and submitting migrationTask afterwards:
DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,332 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34
INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP
DEBUG [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34
INFO [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP
DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34
INFO [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP
......
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,595 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,843 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,878 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
......
DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34
DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34
DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34
.....

Has anyone experienced this scenario? Thanks in advanced!

Sincerely,

Michael Fong

From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
Sent: Wednesday, April 20, 2016 10:43 AM
To: user@cassandra.apache.org; dev@cassandra.apache.org
Subject: Cassandra 2.0.x OOM during bootstrap

Hi, all,

We have recently encountered a Cassandra OOM issue when Cassandra is brought up sometimes (but not always) in our 4-node cluster test bed.

After analyzing the heap dump, we could find the Internal-Response thread pool (JMXEnabledThreadPoolExecutor) is filled with thounds of 'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of heap memory.

According to the documents on internet, it seems internal-response thread pool is somewhat related to schema-checking. Has anyone encountered similar issue before?

We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance!

Sincerely,

Michael Fong