You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Michael Fong <mi...@ruckuswireless.com> on 2016/05/09 03:48:08 UTC

RE: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot

Hi, all,


Haven't heard any responses so far, and this isue has troubled us for quite some time. Here is another update:

We have noticed several times that The schema version may change after migration and reboot:

Here is the scenario:

1.       Two node cluster (1 & 2).

2.       There are some schema changes, i.e. create a few new columnfamily. The cluster will wait until both nodes have schema version in sync (describe cluster) before moving on.

3.       Right before node2 is rebooted, the schema version is consistent; however, after ndoe2 reboots and starts servicing, the MigrationManager would gossip different schema version.

4.       Afterwards, both nodes starts exchanging schema  message indefinitely until one of the node dies.

We currently suspect the change of schema is due to replying the old entry in commit log. We wish to continue dig further, but need experts help on this.

I don't know if anyone has seen this before, or if there is anything wrong with our migration flow though..

Thanks in advance.

Best regards,


Michael Fong

From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
Sent: Thursday, April 21, 2016 6:41 PM
To: user@cassandra.apache.org; dev@cassandra.apache.org
Subject: RE: Cassandra 2.0.x OOM during bootstrap

Hi, all,

Here is some more information on before the OOM happened on the rebooted node in a 2-node test cluster:


1.       It seems the schema version has changed on the rebooted node after reboot, i.e.
Before reboot,
Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 MigrationManager.java (line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 MigrationManager.java (line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f

After rebooting node 2,
Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b



2.       After reboot, both nods repeatedly send MigrationTask to each other - we suspect it is related to the schema version (Digest) mismatch after Node 2 rebooted:
The node2  keeps submitting the migration task over 100+ times to the other node.
INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node /192.168.88.33 has restarted, now UP
INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating topology for /192.168.88.33
INFO [GossipStage:1] 2016-04-19 11:18:18,263 StorageService.java (line 1544) Node /192.168.88.33 state jump to normal
INFO [GossipStage:1] 2016-04-19 11:18:18,264 TokenMetadata.java (line 414) Updating topology for /192.168.88.33
DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33 is down.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33 is down.
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33
INFO [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33
INFO [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33
INFO [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP
DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,356 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
.....


On the otherhand, Node 1 keeps updating its gossip information, followed by receiving and submitting migrationTask afterwards:
DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,332 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34
INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP
DEBUG [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34
INFO [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP
DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34
INFO [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP
......
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,595 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,843 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,878 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
......
DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34
DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34
DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34
.....

Has anyone experienced this scenario? Thanks in advanced!

Sincerely,

Michael Fong

From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
Sent: Wednesday, April 20, 2016 10:43 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>; dev@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Cassandra 2.0.x OOM during bootstrap

Hi, all,

We have recently encountered a Cassandra OOM issue when Cassandra is brought up sometimes (but not always) in our 4-node cluster test bed.

After analyzing the heap dump, we could find the Internal-Response thread pool (JMXEnabledThreadPoolExecutor) is filled with thounds of 'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of heap memory.

According to the documents on internet, it seems internal-response thread pool is somewhat related to schema-checking. Has anyone encountered similar issue before?

We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance!

Sincerely,

Michael Fong

RE: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot

Posted by Michael Fong <mi...@ruckuswireless.com>.

Hi Alain,

Thanks for your reply.

We understood that there is a chance that this would be left unresolved, since we are really way behind the official Cassandra releases.

Here is what have further found for the OOM issue, which seems to be related to # of gossip message accumulated on a live node waiting to connect to the rebooted node. Once that node is rebooted, all the gossip message floods into each other, triggers StorageService.onAlive() and schedules a schema pull on demand. In our case, schema version sometimes is different after reboot. When that happens, schema-exchange storm begins.

Also, thanks for your tip on sharing the SOP on stopping an ode, here is what we have for our stop procedure:
Disable thrift
Disable Binary
Wait 10s
Disable gossip
Drain
Kill <pid>

Any thought on this to be further improved?

Thanks!

Sincerely,

Michael Fong


From: Alain RODRIGUEZ [mailto:arodrime@gmail.com]
Sent: Wednesday, May 11, 2016 10:01 PM
To: user@cassandra.apache.org
Cc: dev@cassandra.apache.org
Subject: Re: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot

Hi Michaels :-),

My guess is this ticket will be closed with a "Won't Fix" resolution.

Cassandra 2.0 is no longer supported and I have seen tickets being rejected like CASSANDRA-10510<https://issues.apache.org/jira/browse/CASSANDRA-10510>.

Would you like to upgrade to 2.1.last and see if you still have the issue?

About your issue, do you stop your node using a command like the following one?

nodetool disablethrift && nodetool disablebinary && sleep 5 && nodetool disablegossip && sleep 10 && nodetool drain && sleep 10 && sudo service cassandra stop

or even flushing:

nodetool disablethrift && nodetool disablebinary && sleep 5 && nodetool disablegossip && sleep 10 && nodetool flush && nodetool drain && sleep 10 && sudo service cassandra stop

Are commitlogs empty when you start cassandra?

C*heers,

-----------------------
Alain Rodriguez - alain@thelastpickle.com<ma...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-05-11 5:35 GMT+02:00 Michael Fong <mi...@ruckuswireless.com>>:
Hi,

Thanks for your recommendation.
I also opened a ticket to keep track @ https://issues.apache.org/jira/browse/CASSANDRA-11748
Hope this could brought someone's attention to take a look. Thanks.

Sincerely,

Michael Fong

-----Original Message-----
From: Michael Kjellman [mailto:mkjellman@internalcircle.com<ma...@internalcircle.com>]
Sent: Monday, May 09, 2016 11:57 AM
To: dev@cassandra.apache.org<ma...@cassandra.apache.org>
Cc: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot

I'd recommend you create a JIRA! That way you can get some traction on the issue. Obviously an OOM is never correct, even if your process is wrong in some way!

Best,
kjellman

Sent from my iPhone

> On May 8, 2016, at 8:48 PM, Michael Fong <mi...@ruckuswireless.com>> wrote:
>
> Hi, all,
>
>
> Haven't heard any responses so far, and this isue has troubled us for quite some time. Here is another update:
>
> We have noticed several times that The schema version may change after migration and reboot:
>
> Here is the scenario:
>
> 1.       Two node cluster (1 & 2).
>
> 2.       There are some schema changes, i.e. create a few new columnfamily. The cluster will wait until both nodes have schema version in sync (describe cluster) before moving on.
>
> 3.       Right before node2 is rebooted, the schema version is consistent; however, after ndoe2 reboots and starts servicing, the MigrationManager would gossip different schema version.
>
> 4.       Afterwards, both nodes starts exchanging schema  message indefinitely until one of the node dies.
>
> We currently suspect the change of schema is due to replying the old entry in commit log. We wish to continue dig further, but need experts help on this.
>
> I don't know if anyone has seen this before, or if there is anything wrong with our migration flow though..
>
> Thanks in advance.
>
> Best regards,
>
>
> Michael Fong
>
> From: Michael Fong [mailto:michael.fong@ruckuswireless.com<ma...@ruckuswireless.com>]
> Sent: Thursday, April 21, 2016 6:41 PM
> To: user@cassandra.apache.org<ma...@cassandra.apache.org>; dev@cassandra.apache.org<ma...@cassandra.apache.org>
> Subject: RE: Cassandra 2.0.x OOM during bootstrap
>
> Hi, all,
>
> Here is some more information on before the OOM happened on the rebooted node in a 2-node test cluster:
>
>
> 1.       It seems the schema version has changed on the rebooted node after reboot, i.e.
> Before reboot,
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326
> MigrationManager.java (line 328) Gossiping my schema version
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122
> MigrationManager.java (line 328) Gossiping my schema version
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
>
> After rebooting node 2,
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java
> (line 328) Gossiping my schema version
> f5270873-ba1f-39c7-ab2e-a86db868b09b
>
>
>
> 2.       After reboot, both nods repeatedly send MigrationTask to each other - we suspect it is related to the schema version (Digest) mismatch after Node 2 rebooted:
> The node2  keeps submitting the migration task over 100+ times to the other node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011)
> Node /192.168.88.33<http://192.168.88.33> has restarted, now UP INFO [GossipStage:1]
> 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating
> topology for /192.168.88.33<http://192.168.88.33> INFO [GossipStage:1] 2016-04-19
> 11:18:18,263 StorageService.java (line 1544) Node /192.168.88.33<http://192.168.88.33> state
> jump to normal INFO [GossipStage:1] 2016-04-19 11:18:18,264
> TokenMetadata.java (line 414) Updating topology for /192.168.88.33<http://192.168.88.33> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33<http://192.168.88.33> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33<http://192.168.88.33> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33<http://192.168.88.33> is down.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33<http://192.168.88.33> is down.
> DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java
> (line 977) removing expire time for endpoint : /192.168.88.33<http://192.168.88.33> INFO
> [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line
> 978) InetAddress /192.168.88.33<http://192.168.88.33> is now UP DEBUG
> [RequestResponseStage:1] 2016-04-19 11:18:18,353 MigrationManager.java
> (line 102) Submitting migration task for /192.168.88.33<http://192.168.88.33> DEBUG
> [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line
> 977) removing expire time for endpoint : /192.168.88.33<http://192.168.88.33> INFO
> [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line
> 978) InetAddress /192.168.88.33<http://192.168.88.33> is now UP DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33<http://192.168.88.33> DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33<http://192.168.88.33> INFO [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 978) InetAddress /192.168.88.33<http://192.168.88.33> is now UP DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,356 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33<http://192.168.88.33> .....
>
>
> On the otherhand, Node 1 keeps updating its gossip information, followed by receiving and submitting migrationTask afterwards:
> DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,332 Gossiper.java
> (line 977) removing expire time for endpoint : /192.168.88.34<http://192.168.88.34> INFO
> [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line
> 978) InetAddress /192.168.88.34<http://192.168.88.34> is now UP DEBUG
> [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line
> 977) removing expire time for endpoint : /192.168.88.34<http://192.168.88.34> INFO
> [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34<http://192.168.88.34> is now UP DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34<http://192.168.88.34> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34<http://192.168.88.34> is now UP ......
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34<http://192.168.88.34>.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,595 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34<http://192.168.88.34>.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,843 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34<http://192.168.88.34>.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,878 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34<http://192.168.88.34>.
> ......
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java
> (line 127) submitting migration task for /192.168.88.34<http://192.168.88.34> DEBUG
> [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line
> 127) submitting migration task for /192.168.88.34<http://192.168.88.34> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34<http://192.168.88.34> .....
>
> Has anyone experienced this scenario? Thanks in advanced!
>
> Sincerely,
>
> Michael Fong
>
> From: Michael Fong [mailto:michael.fong@ruckuswireless.com<ma...@ruckuswireless.com>]
> Sent: Wednesday, April 20, 2016 10:43 AM
> To: user@cassandra.apache.org<ma...@cassandra.apache.org>>;
> dev@cassandra.apache.org<ma...@cassandra.apache.org>>
> Subject: Cassandra 2.0.x OOM during bootstrap
>
> Hi, all,
>
> We have recently encountered a Cassandra OOM issue when Cassandra is brought up sometimes (but not always) in our 4-node cluster test bed.
>
> After analyzing the heap dump, we could find the Internal-Response thread pool (JMXEnabledThreadPoolExecutor) is filled with thounds of 'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of heap memory.
>
> According to the documents on internet, it seems internal-response thread pool is somewhat related to schema-checking. Has anyone encountered similar issue before?
>
> We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance!
>
> Sincerely,
>
> Michael Fong

RE: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot

Posted by Michael Fong <mi...@ruckuswireless.com>.

Hi Alain,

Thanks for your reply.

We understood that there is a chance that this would be left unresolved, since we are really way behind the official Cassandra releases.

Here is what have further found for the OOM issue, which seems to be related to # of gossip message accumulated on a live node waiting to connect to the rebooted node. Once that node is rebooted, all the gossip message floods into each other, triggers StorageService.onAlive() and schedules a schema pull on demand. In our case, schema version sometimes is different after reboot. When that happens, schema-exchange storm begins.

Also, thanks for your tip on sharing the SOP on stopping an ode, here is what we have for our stop procedure:
Disable thrift
Disable Binary
Wait 10s
Disable gossip
Drain
Kill <pid>

Any thought on this to be further improved?

Thanks!

Sincerely,

Michael Fong


From: Alain RODRIGUEZ [mailto:arodrime@gmail.com]
Sent: Wednesday, May 11, 2016 10:01 PM
To: user@cassandra.apache.org
Cc: dev@cassandra.apache.org
Subject: Re: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot

Hi Michaels :-),

My guess is this ticket will be closed with a "Won't Fix" resolution.

Cassandra 2.0 is no longer supported and I have seen tickets being rejected like CASSANDRA-10510<https://issues.apache.org/jira/browse/CASSANDRA-10510>.

Would you like to upgrade to 2.1.last and see if you still have the issue?

About your issue, do you stop your node using a command like the following one?

nodetool disablethrift && nodetool disablebinary && sleep 5 && nodetool disablegossip && sleep 10 && nodetool drain && sleep 10 && sudo service cassandra stop

or even flushing:

nodetool disablethrift && nodetool disablebinary && sleep 5 && nodetool disablegossip && sleep 10 && nodetool flush && nodetool drain && sleep 10 && sudo service cassandra stop

Are commitlogs empty when you start cassandra?

C*heers,

-----------------------
Alain Rodriguez - alain@thelastpickle.com<ma...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-05-11 5:35 GMT+02:00 Michael Fong <mi...@ruckuswireless.com>>:
Hi,

Thanks for your recommendation.
I also opened a ticket to keep track @ https://issues.apache.org/jira/browse/CASSANDRA-11748
Hope this could brought someone's attention to take a look. Thanks.

Sincerely,

Michael Fong

-----Original Message-----
From: Michael Kjellman [mailto:mkjellman@internalcircle.com<ma...@internalcircle.com>]
Sent: Monday, May 09, 2016 11:57 AM
To: dev@cassandra.apache.org<ma...@cassandra.apache.org>
Cc: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot

I'd recommend you create a JIRA! That way you can get some traction on the issue. Obviously an OOM is never correct, even if your process is wrong in some way!

Best,
kjellman

Sent from my iPhone

> On May 8, 2016, at 8:48 PM, Michael Fong <mi...@ruckuswireless.com>> wrote:
>
> Hi, all,
>
>
> Haven't heard any responses so far, and this isue has troubled us for quite some time. Here is another update:
>
> We have noticed several times that The schema version may change after migration and reboot:
>
> Here is the scenario:
>
> 1.       Two node cluster (1 & 2).
>
> 2.       There are some schema changes, i.e. create a few new columnfamily. The cluster will wait until both nodes have schema version in sync (describe cluster) before moving on.
>
> 3.       Right before node2 is rebooted, the schema version is consistent; however, after ndoe2 reboots and starts servicing, the MigrationManager would gossip different schema version.
>
> 4.       Afterwards, both nodes starts exchanging schema  message indefinitely until one of the node dies.
>
> We currently suspect the change of schema is due to replying the old entry in commit log. We wish to continue dig further, but need experts help on this.
>
> I don't know if anyone has seen this before, or if there is anything wrong with our migration flow though..
>
> Thanks in advance.
>
> Best regards,
>
>
> Michael Fong
>
> From: Michael Fong [mailto:michael.fong@ruckuswireless.com<ma...@ruckuswireless.com>]
> Sent: Thursday, April 21, 2016 6:41 PM
> To: user@cassandra.apache.org<ma...@cassandra.apache.org>; dev@cassandra.apache.org<ma...@cassandra.apache.org>
> Subject: RE: Cassandra 2.0.x OOM during bootstrap
>
> Hi, all,
>
> Here is some more information on before the OOM happened on the rebooted node in a 2-node test cluster:
>
>
> 1.       It seems the schema version has changed on the rebooted node after reboot, i.e.
> Before reboot,
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326
> MigrationManager.java (line 328) Gossiping my schema version
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122
> MigrationManager.java (line 328) Gossiping my schema version
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
>
> After rebooting node 2,
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java
> (line 328) Gossiping my schema version
> f5270873-ba1f-39c7-ab2e-a86db868b09b
>
>
>
> 2.       After reboot, both nods repeatedly send MigrationTask to each other - we suspect it is related to the schema version (Digest) mismatch after Node 2 rebooted:
> The node2  keeps submitting the migration task over 100+ times to the other node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011)
> Node /192.168.88.33<http://192.168.88.33> has restarted, now UP INFO [GossipStage:1]
> 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating
> topology for /192.168.88.33<http://192.168.88.33> INFO [GossipStage:1] 2016-04-19
> 11:18:18,263 StorageService.java (line 1544) Node /192.168.88.33<http://192.168.88.33> state
> jump to normal INFO [GossipStage:1] 2016-04-19 11:18:18,264
> TokenMetadata.java (line 414) Updating topology for /192.168.88.33<http://192.168.88.33> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33<http://192.168.88.33> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33<http://192.168.88.33> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33<http://192.168.88.33> is down.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33<http://192.168.88.33> is down.
> DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java
> (line 977) removing expire time for endpoint : /192.168.88.33<http://192.168.88.33> INFO
> [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line
> 978) InetAddress /192.168.88.33<http://192.168.88.33> is now UP DEBUG
> [RequestResponseStage:1] 2016-04-19 11:18:18,353 MigrationManager.java
> (line 102) Submitting migration task for /192.168.88.33<http://192.168.88.33> DEBUG
> [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line
> 977) removing expire time for endpoint : /192.168.88.33<http://192.168.88.33> INFO
> [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line
> 978) InetAddress /192.168.88.33<http://192.168.88.33> is now UP DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33<http://192.168.88.33> DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33<http://192.168.88.33> INFO [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 978) InetAddress /192.168.88.33<http://192.168.88.33> is now UP DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,356 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33<http://192.168.88.33> .....
>
>
> On the otherhand, Node 1 keeps updating its gossip information, followed by receiving and submitting migrationTask afterwards:
> DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,332 Gossiper.java
> (line 977) removing expire time for endpoint : /192.168.88.34<http://192.168.88.34> INFO
> [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line
> 978) InetAddress /192.168.88.34<http://192.168.88.34> is now UP DEBUG
> [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line
> 977) removing expire time for endpoint : /192.168.88.34<http://192.168.88.34> INFO
> [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34<http://192.168.88.34> is now UP DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34<http://192.168.88.34> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34<http://192.168.88.34> is now UP ......
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34<http://192.168.88.34>.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,595 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34<http://192.168.88.34>.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,843 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34<http://192.168.88.34>.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,878 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34<http://192.168.88.34>.
> ......
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java
> (line 127) submitting migration task for /192.168.88.34<http://192.168.88.34> DEBUG
> [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line
> 127) submitting migration task for /192.168.88.34<http://192.168.88.34> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34<http://192.168.88.34> .....
>
> Has anyone experienced this scenario? Thanks in advanced!
>
> Sincerely,
>
> Michael Fong
>
> From: Michael Fong [mailto:michael.fong@ruckuswireless.com<ma...@ruckuswireless.com>]
> Sent: Wednesday, April 20, 2016 10:43 AM
> To: user@cassandra.apache.org<ma...@cassandra.apache.org>>;
> dev@cassandra.apache.org<ma...@cassandra.apache.org>>
> Subject: Cassandra 2.0.x OOM during bootstrap
>
> Hi, all,
>
> We have recently encountered a Cassandra OOM issue when Cassandra is brought up sometimes (but not always) in our 4-node cluster test bed.
>
> After analyzing the heap dump, we could find the Internal-Response thread pool (JMXEnabledThreadPoolExecutor) is filled with thounds of 'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of heap memory.
>
> According to the documents on internet, it seems internal-response thread pool is somewhat related to schema-checking. Has anyone encountered similar issue before?
>
> We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance!
>
> Sincerely,
>
> Michael Fong

Re: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Hi Michaels :-),

My guess is this ticket will be closed with a "Won't Fix" resolution.

Cassandra 2.0 is no longer supported and I have seen tickets being rejected
like CASSANDRA-10510 <https://issues.apache.org/jira/browse/CASSANDRA-10510>
.

Would you like to upgrade to 2.1.last and see if you still have the issue?

About your issue, do you stop your node using a command like the following
one?

nodetool disablethrift && nodetool disablebinary && sleep 5 && nodetool
disablegossip && sleep 10 && nodetool drain && sleep 10 && sudo service
cassandra stop

or even flushing:

nodetool disablethrift && nodetool disablebinary && sleep 5 && nodetool
disablegossip && sleep 10 && nodetool flush && nodetool drain && sleep 10
&& sudo service cassandra stop

Are commitlogs empty when you start cassandra?

C*heers,

-----------------------
Alain Rodriguez - alain@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-05-11 5:35 GMT+02:00 Michael Fong <mi...@ruckuswireless.com>:

> Hi,
>
> Thanks for your recommendation.
> I also opened a ticket to keep track @
> https://issues.apache.org/jira/browse/CASSANDRA-11748
> Hope this could brought someone's attention to take a look. Thanks.
>
> Sincerely,
>
> Michael Fong
>
> -----Original Message-----
> From: Michael Kjellman [mailto:mkjellman@internalcircle.com]
> Sent: Monday, May 09, 2016 11:57 AM
> To: dev@cassandra.apache.org
> Cc: user@cassandra.apache.org
> Subject: Re: Cassandra 2.0.x OOM during startsup - schema version
> inconsistency after reboot
>
> I'd recommend you create a JIRA! That way you can get some traction on the
> issue. Obviously an OOM is never correct, even if your process is wrong in
> some way!
>
> Best,
> kjellman
>
> Sent from my iPhone
>
> > On May 8, 2016, at 8:48 PM, Michael Fong <
> michael.fong@ruckuswireless.com> wrote:
> >
> > Hi, all,
> >
> >
> > Haven't heard any responses so far, and this isue has troubled us for
> quite some time. Here is another update:
> >
> > We have noticed several times that The schema version may change after
> migration and reboot:
> >
> > Here is the scenario:
> >
> > 1.       Two node cluster (1 & 2).
> >
> > 2.       There are some schema changes, i.e. create a few new
> columnfamily. The cluster will wait until both nodes have schema version in
> sync (describe cluster) before moving on.
> >
> > 3.       Right before node2 is rebooted, the schema version is
> consistent; however, after ndoe2 reboots and starts servicing, the
> MigrationManager would gossip different schema version.
> >
> > 4.       Afterwards, both nodes starts exchanging schema  message
> indefinitely until one of the node dies.
> >
> > We currently suspect the change of schema is due to replying the old
> entry in commit log. We wish to continue dig further, but need experts help
> on this.
> >
> > I don't know if anyone has seen this before, or if there is anything
> wrong with our migration flow though..
> >
> > Thanks in advance.
> >
> > Best regards,
> >
> >
> > Michael Fong
> >
> > From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
> > Sent: Thursday, April 21, 2016 6:41 PM
> > To: user@cassandra.apache.org; dev@cassandra.apache.org
> > Subject: RE: Cassandra 2.0.x OOM during bootstrap
> >
> > Hi, all,
> >
> > Here is some more information on before the OOM happened on the rebooted
> node in a 2-node test cluster:
> >
> >
> > 1.       It seems the schema version has changed on the rebooted node
> after reboot, i.e.
> > Before reboot,
> > Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326
> > MigrationManager.java (line 328) Gossiping my schema version
> > 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> > Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122
> > MigrationManager.java (line 328) Gossiping my schema version
> > 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> >
> > After rebooting node 2,
> > Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java
> > (line 328) Gossiping my schema version
> > f5270873-ba1f-39c7-ab2e-a86db868b09b
> >
> >
> >
> > 2.       After reboot, both nods repeatedly send MigrationTask to each
> other - we suspect it is related to the schema version (Digest) mismatch
> after Node 2 rebooted:
> > The node2  keeps submitting the migration task over 100+ times to the
> other node.
> > INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011)
> > Node /192.168.88.33 has restarted, now UP INFO [GossipStage:1]
> > 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating
> > topology for /192.168.88.33 INFO [GossipStage:1] 2016-04-19
> > 11:18:18,263 StorageService.java (line 1544) Node /192.168.88.33 state
> > jump to normal INFO [GossipStage:1] 2016-04-19 11:18:18,264
> > TokenMetadata.java (line 414) Updating topology for /192.168.88.33
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line
> 102) Submitting migration task for /192.168.88.33 DEBUG [GossipStage:1]
> 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting
> migration task for /192.168.88.33 DEBUG [MigrationStage:1] 2016-04-19
> 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request:
> node /192.168.88.33 is down.
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java
> (line 62) Can't send schema pull request: node /192.168.88.33 is down.
> > DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java
> > (line 977) removing expire time for endpoint : /192.168.88.33 INFO
> > [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line
> > 978) InetAddress /192.168.88.33 is now UP DEBUG
> > [RequestResponseStage:1] 2016-04-19 11:18:18,353 MigrationManager.java
> > (line 102) Submitting migration task for /192.168.88.33 DEBUG
> > [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line
> > 977) removing expire time for endpoint : /192.168.88.33 INFO
> > [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line
> > 978) InetAddress /192.168.88.33 is now UP DEBUG
> [RequestResponseStage:1] 2016-04-19 11:18:18,355 MigrationManager.java
> (line 102) Submitting migration task for /192.168.88.33 DEBUG
> [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 977)
> removing expire time for endpoint : /192.168.88.33 INFO
> [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 978)
> InetAddress /192.168.88.33 is now UP DEBUG [RequestResponseStage:2]
> 2016-04-19 11:18:18,356 MigrationManager.java (line 102) Submitting
> migration task for /192.168.88.33 .....
> >
> >
> > On the otherhand, Node 1 keeps updating its gossip information, followed
> by receiving and submitting migrationTask afterwards:
> > DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,332 Gossiper.java
> > (line 977) removing expire time for endpoint : /192.168.88.34 INFO
> > [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line
> > 978) InetAddress /192.168.88.34 is now UP DEBUG
> > [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line
> > 977) removing expire time for endpoint : /192.168.88.34 INFO
> > [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line
> 978) InetAddress /192.168.88.34 is now UP DEBUG [RequestResponseStage:3]
> 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for
> endpoint : /192.168.88.34 INFO [RequestResponseStage:3] 2016-04-19
> 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now
> UP ......
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496
> MigrationRequestVerbHandler.java (line 41) Received migration request from /
> 192.168.88.34.
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,595
> MigrationRequestVerbHandler.java (line 41) Received migration request from /
> 192.168.88.34.
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,843
> MigrationRequestVerbHandler.java (line 41) Received migration request from /
> 192.168.88.34.
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,878
> MigrationRequestVerbHandler.java (line 41) Received migration request from /
> 192.168.88.34.
> > ......
> > DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java
> > (line 127) submitting migration task for /192.168.88.34 DEBUG
> > [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line
> > 127) submitting migration task for /192.168.88.34 DEBUG
> [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127)
> submitting migration task for /192.168.88.34 .....
> >
> > Has anyone experienced this scenario? Thanks in advanced!
> >
> > Sincerely,
> >
> > Michael Fong
> >
> > From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
> > Sent: Wednesday, April 20, 2016 10:43 AM
> > To: user@cassandra.apache.org<ma...@cassandra.apache.org>;
> > dev@cassandra.apache.org<ma...@cassandra.apache.org>
> > Subject: Cassandra 2.0.x OOM during bootstrap
> >
> > Hi, all,
> >
> > We have recently encountered a Cassandra OOM issue when Cassandra is
> brought up sometimes (but not always) in our 4-node cluster test bed.
> >
> > After analyzing the heap dump, we could find the Internal-Response
> thread pool (JMXEnabledThreadPoolExecutor) is filled with thounds of
> 'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of
> heap memory.
> >
> > According to the documents on internet, it seems internal-response
> thread pool is somewhat related to schema-checking. Has anyone encountered
> similar issue before?
> >
> > We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance!
> >
> > Sincerely,
> >
> > Michael Fong
>

Re: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Hi Michaels :-),

My guess is this ticket will be closed with a "Won't Fix" resolution.

Cassandra 2.0 is no longer supported and I have seen tickets being rejected
like CASSANDRA-10510 <https://issues.apache.org/jira/browse/CASSANDRA-10510>
.

Would you like to upgrade to 2.1.last and see if you still have the issue?

About your issue, do you stop your node using a command like the following
one?

nodetool disablethrift && nodetool disablebinary && sleep 5 && nodetool
disablegossip && sleep 10 && nodetool drain && sleep 10 && sudo service
cassandra stop

or even flushing:

nodetool disablethrift && nodetool disablebinary && sleep 5 && nodetool
disablegossip && sleep 10 && nodetool flush && nodetool drain && sleep 10
&& sudo service cassandra stop

Are commitlogs empty when you start cassandra?

C*heers,

-----------------------
Alain Rodriguez - alain@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-05-11 5:35 GMT+02:00 Michael Fong <mi...@ruckuswireless.com>:

> Hi,
>
> Thanks for your recommendation.
> I also opened a ticket to keep track @
> https://issues.apache.org/jira/browse/CASSANDRA-11748
> Hope this could brought someone's attention to take a look. Thanks.
>
> Sincerely,
>
> Michael Fong
>
> -----Original Message-----
> From: Michael Kjellman [mailto:mkjellman@internalcircle.com]
> Sent: Monday, May 09, 2016 11:57 AM
> To: dev@cassandra.apache.org
> Cc: user@cassandra.apache.org
> Subject: Re: Cassandra 2.0.x OOM during startsup - schema version
> inconsistency after reboot
>
> I'd recommend you create a JIRA! That way you can get some traction on the
> issue. Obviously an OOM is never correct, even if your process is wrong in
> some way!
>
> Best,
> kjellman
>
> Sent from my iPhone
>
> > On May 8, 2016, at 8:48 PM, Michael Fong <
> michael.fong@ruckuswireless.com> wrote:
> >
> > Hi, all,
> >
> >
> > Haven't heard any responses so far, and this isue has troubled us for
> quite some time. Here is another update:
> >
> > We have noticed several times that The schema version may change after
> migration and reboot:
> >
> > Here is the scenario:
> >
> > 1.       Two node cluster (1 & 2).
> >
> > 2.       There are some schema changes, i.e. create a few new
> columnfamily. The cluster will wait until both nodes have schema version in
> sync (describe cluster) before moving on.
> >
> > 3.       Right before node2 is rebooted, the schema version is
> consistent; however, after ndoe2 reboots and starts servicing, the
> MigrationManager would gossip different schema version.
> >
> > 4.       Afterwards, both nodes starts exchanging schema  message
> indefinitely until one of the node dies.
> >
> > We currently suspect the change of schema is due to replying the old
> entry in commit log. We wish to continue dig further, but need experts help
> on this.
> >
> > I don't know if anyone has seen this before, or if there is anything
> wrong with our migration flow though..
> >
> > Thanks in advance.
> >
> > Best regards,
> >
> >
> > Michael Fong
> >
> > From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
> > Sent: Thursday, April 21, 2016 6:41 PM
> > To: user@cassandra.apache.org; dev@cassandra.apache.org
> > Subject: RE: Cassandra 2.0.x OOM during bootstrap
> >
> > Hi, all,
> >
> > Here is some more information on before the OOM happened on the rebooted
> node in a 2-node test cluster:
> >
> >
> > 1.       It seems the schema version has changed on the rebooted node
> after reboot, i.e.
> > Before reboot,
> > Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326
> > MigrationManager.java (line 328) Gossiping my schema version
> > 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> > Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122
> > MigrationManager.java (line 328) Gossiping my schema version
> > 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> >
> > After rebooting node 2,
> > Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java
> > (line 328) Gossiping my schema version
> > f5270873-ba1f-39c7-ab2e-a86db868b09b
> >
> >
> >
> > 2.       After reboot, both nods repeatedly send MigrationTask to each
> other - we suspect it is related to the schema version (Digest) mismatch
> after Node 2 rebooted:
> > The node2  keeps submitting the migration task over 100+ times to the
> other node.
> > INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011)
> > Node /192.168.88.33 has restarted, now UP INFO [GossipStage:1]
> > 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating
> > topology for /192.168.88.33 INFO [GossipStage:1] 2016-04-19
> > 11:18:18,263 StorageService.java (line 1544) Node /192.168.88.33 state
> > jump to normal INFO [GossipStage:1] 2016-04-19 11:18:18,264
> > TokenMetadata.java (line 414) Updating topology for /192.168.88.33
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line
> 102) Submitting migration task for /192.168.88.33 DEBUG [GossipStage:1]
> 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting
> migration task for /192.168.88.33 DEBUG [MigrationStage:1] 2016-04-19
> 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request:
> node /192.168.88.33 is down.
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java
> (line 62) Can't send schema pull request: node /192.168.88.33 is down.
> > DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java
> > (line 977) removing expire time for endpoint : /192.168.88.33 INFO
> > [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line
> > 978) InetAddress /192.168.88.33 is now UP DEBUG
> > [RequestResponseStage:1] 2016-04-19 11:18:18,353 MigrationManager.java
> > (line 102) Submitting migration task for /192.168.88.33 DEBUG
> > [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line
> > 977) removing expire time for endpoint : /192.168.88.33 INFO
> > [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line
> > 978) InetAddress /192.168.88.33 is now UP DEBUG
> [RequestResponseStage:1] 2016-04-19 11:18:18,355 MigrationManager.java
> (line 102) Submitting migration task for /192.168.88.33 DEBUG
> [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 977)
> removing expire time for endpoint : /192.168.88.33 INFO
> [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 978)
> InetAddress /192.168.88.33 is now UP DEBUG [RequestResponseStage:2]
> 2016-04-19 11:18:18,356 MigrationManager.java (line 102) Submitting
> migration task for /192.168.88.33 .....
> >
> >
> > On the otherhand, Node 1 keeps updating its gossip information, followed
> by receiving and submitting migrationTask afterwards:
> > DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,332 Gossiper.java
> > (line 977) removing expire time for endpoint : /192.168.88.34 INFO
> > [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line
> > 978) InetAddress /192.168.88.34 is now UP DEBUG
> > [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line
> > 977) removing expire time for endpoint : /192.168.88.34 INFO
> > [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line
> 978) InetAddress /192.168.88.34 is now UP DEBUG [RequestResponseStage:3]
> 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for
> endpoint : /192.168.88.34 INFO [RequestResponseStage:3] 2016-04-19
> 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now
> UP ......
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496
> MigrationRequestVerbHandler.java (line 41) Received migration request from /
> 192.168.88.34.
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,595
> MigrationRequestVerbHandler.java (line 41) Received migration request from /
> 192.168.88.34.
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,843
> MigrationRequestVerbHandler.java (line 41) Received migration request from /
> 192.168.88.34.
> > DEBUG [MigrationStage:1] 2016-04-19 11:18:18,878
> MigrationRequestVerbHandler.java (line 41) Received migration request from /
> 192.168.88.34.
> > ......
> > DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java
> > (line 127) submitting migration task for /192.168.88.34 DEBUG
> > [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line
> > 127) submitting migration task for /192.168.88.34 DEBUG
> [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127)
> submitting migration task for /192.168.88.34 .....
> >
> > Has anyone experienced this scenario? Thanks in advanced!
> >
> > Sincerely,
> >
> > Michael Fong
> >
> > From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
> > Sent: Wednesday, April 20, 2016 10:43 AM
> > To: user@cassandra.apache.org<ma...@cassandra.apache.org>;
> > dev@cassandra.apache.org<ma...@cassandra.apache.org>
> > Subject: Cassandra 2.0.x OOM during bootstrap
> >
> > Hi, all,
> >
> > We have recently encountered a Cassandra OOM issue when Cassandra is
> brought up sometimes (but not always) in our 4-node cluster test bed.
> >
> > After analyzing the heap dump, we could find the Internal-Response
> thread pool (JMXEnabledThreadPoolExecutor) is filled with thounds of
> 'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of
> heap memory.
> >
> > According to the documents on internet, it seems internal-response
> thread pool is somewhat related to schema-checking. Has anyone encountered
> similar issue before?
> >
> > We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance!
> >
> > Sincerely,
> >
> > Michael Fong
>

RE: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot

Posted by Michael Fong <mi...@ruckuswireless.com>.

Hi,

Thanks for your recommendation. 
I also opened a ticket to keep track @ https://issues.apache.org/jira/browse/CASSANDRA-11748
Hope this could brought someone's attention to take a look. Thanks.

Sincerely,

Michael Fong

-----Original Message-----
From: Michael Kjellman [mailto:mkjellman@internalcircle.com] 
Sent: Monday, May 09, 2016 11:57 AM
To: dev@cassandra.apache.org
Cc: user@cassandra.apache.org
Subject: Re: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot

I'd recommend you create a JIRA! That way you can get some traction on the issue. Obviously an OOM is never correct, even if your process is wrong in some way!

Best,
kjellman 

Sent from my iPhone

> On May 8, 2016, at 8:48 PM, Michael Fong <mi...@ruckuswireless.com> wrote:
> 
> Hi, all,
> 
> 
> Haven't heard any responses so far, and this isue has troubled us for quite some time. Here is another update:
> 
> We have noticed several times that The schema version may change after migration and reboot:
> 
> Here is the scenario:
> 
> 1.       Two node cluster (1 & 2).
> 
> 2.       There are some schema changes, i.e. create a few new columnfamily. The cluster will wait until both nodes have schema version in sync (describe cluster) before moving on.
> 
> 3.       Right before node2 is rebooted, the schema version is consistent; however, after ndoe2 reboots and starts servicing, the MigrationManager would gossip different schema version.
> 
> 4.       Afterwards, both nodes starts exchanging schema  message indefinitely until one of the node dies.
> 
> We currently suspect the change of schema is due to replying the old entry in commit log. We wish to continue dig further, but need experts help on this.
> 
> I don't know if anyone has seen this before, or if there is anything wrong with our migration flow though..
> 
> Thanks in advance.
> 
> Best regards,
> 
> 
> Michael Fong
> 
> From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
> Sent: Thursday, April 21, 2016 6:41 PM
> To: user@cassandra.apache.org; dev@cassandra.apache.org
> Subject: RE: Cassandra 2.0.x OOM during bootstrap
> 
> Hi, all,
> 
> Here is some more information on before the OOM happened on the rebooted node in a 2-node test cluster:
> 
> 
> 1.       It seems the schema version has changed on the rebooted node after reboot, i.e.
> Before reboot,
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> 
> After rebooting node 2,
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java 
> (line 328) Gossiping my schema version 
> f5270873-ba1f-39c7-ab2e-a86db868b09b
> 
> 
> 
> 2.       After reboot, both nods repeatedly send MigrationTask to each other - we suspect it is related to the schema version (Digest) mismatch after Node 2 rebooted:
> The node2  keeps submitting the migration task over 100+ times to the other node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) 
> Node /192.168.88.33 has restarted, now UP INFO [GossipStage:1] 
> 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating 
> topology for /192.168.88.33 INFO [GossipStage:1] 2016-04-19 
> 11:18:18,263 StorageService.java (line 1544) Node /192.168.88.33 state 
> jump to normal INFO [GossipStage:1] 2016-04-19 11:18:18,264 
> TokenMetadata.java (line 414) Updating topology for /192.168.88.33 DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33 DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33 DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33 is down.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33 is down.
> DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java 
> (line 977) removing expire time for endpoint : /192.168.88.33 INFO 
> [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line 
> 978) InetAddress /192.168.88.33 is now UP DEBUG 
> [RequestResponseStage:1] 2016-04-19 11:18:18,353 MigrationManager.java 
> (line 102) Submitting migration task for /192.168.88.33 DEBUG 
> [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 
> 977) removing expire time for endpoint : /192.168.88.33 INFO 
> [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 
> 978) InetAddress /192.168.88.33 is now UP DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33 DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33 INFO [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,356 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33 .....
> 
> 
> On the otherhand, Node 1 keeps updating its gossip information, followed by receiving and submitting migrationTask afterwards:
> DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,332 Gossiper.java 
> (line 977) removing expire time for endpoint : /192.168.88.34 INFO 
> [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP DEBUG 
> [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 
> 977) removing expire time for endpoint : /192.168.88.34 INFO 
> [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34 INFO [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP ......
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,595 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,843 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,878 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
> ......
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java 
> (line 127) submitting migration task for /192.168.88.34 DEBUG 
> [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34 DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34 .....
> 
> Has anyone experienced this scenario? Thanks in advanced!
> 
> Sincerely,
> 
> Michael Fong
> 
> From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
> Sent: Wednesday, April 20, 2016 10:43 AM
> To: user@cassandra.apache.org<ma...@cassandra.apache.org>; 
> dev@cassandra.apache.org<ma...@cassandra.apache.org>
> Subject: Cassandra 2.0.x OOM during bootstrap
> 
> Hi, all,
> 
> We have recently encountered a Cassandra OOM issue when Cassandra is brought up sometimes (but not always) in our 4-node cluster test bed.
> 
> After analyzing the heap dump, we could find the Internal-Response thread pool (JMXEnabledThreadPoolExecutor) is filled with thounds of 'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of heap memory.
> 
> According to the documents on internet, it seems internal-response thread pool is somewhat related to schema-checking. Has anyone encountered similar issue before?
> 
> We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance!
> 
> Sincerely,
> 
> Michael Fong

RE: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot

Posted by Michael Fong <mi...@ruckuswireless.com>.

Hi,

Thanks for your recommendation. 
I also opened a ticket to keep track @ https://issues.apache.org/jira/browse/CASSANDRA-11748
Hope this could brought someone's attention to take a look. Thanks.

Sincerely,

Michael Fong

-----Original Message-----
From: Michael Kjellman [mailto:mkjellman@internalcircle.com] 
Sent: Monday, May 09, 2016 11:57 AM
To: dev@cassandra.apache.org
Cc: user@cassandra.apache.org
Subject: Re: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot

I'd recommend you create a JIRA! That way you can get some traction on the issue. Obviously an OOM is never correct, even if your process is wrong in some way!

Best,
kjellman 

Sent from my iPhone

> On May 8, 2016, at 8:48 PM, Michael Fong <mi...@ruckuswireless.com> wrote:
> 
> Hi, all,
> 
> 
> Haven't heard any responses so far, and this isue has troubled us for quite some time. Here is another update:
> 
> We have noticed several times that The schema version may change after migration and reboot:
> 
> Here is the scenario:
> 
> 1.       Two node cluster (1 & 2).
> 
> 2.       There are some schema changes, i.e. create a few new columnfamily. The cluster will wait until both nodes have schema version in sync (describe cluster) before moving on.
> 
> 3.       Right before node2 is rebooted, the schema version is consistent; however, after ndoe2 reboots and starts servicing, the MigrationManager would gossip different schema version.
> 
> 4.       Afterwards, both nodes starts exchanging schema  message indefinitely until one of the node dies.
> 
> We currently suspect the change of schema is due to replying the old entry in commit log. We wish to continue dig further, but need experts help on this.
> 
> I don't know if anyone has seen this before, or if there is anything wrong with our migration flow though..
> 
> Thanks in advance.
> 
> Best regards,
> 
> 
> Michael Fong
> 
> From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
> Sent: Thursday, April 21, 2016 6:41 PM
> To: user@cassandra.apache.org; dev@cassandra.apache.org
> Subject: RE: Cassandra 2.0.x OOM during bootstrap
> 
> Hi, all,
> 
> Here is some more information on before the OOM happened on the rebooted node in a 2-node test cluster:
> 
> 
> 1.       It seems the schema version has changed on the rebooted node after reboot, i.e.
> Before reboot,
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> 
> After rebooting node 2,
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java 
> (line 328) Gossiping my schema version 
> f5270873-ba1f-39c7-ab2e-a86db868b09b
> 
> 
> 
> 2.       After reboot, both nods repeatedly send MigrationTask to each other - we suspect it is related to the schema version (Digest) mismatch after Node 2 rebooted:
> The node2  keeps submitting the migration task over 100+ times to the other node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) 
> Node /192.168.88.33 has restarted, now UP INFO [GossipStage:1] 
> 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating 
> topology for /192.168.88.33 INFO [GossipStage:1] 2016-04-19 
> 11:18:18,263 StorageService.java (line 1544) Node /192.168.88.33 state 
> jump to normal INFO [GossipStage:1] 2016-04-19 11:18:18,264 
> TokenMetadata.java (line 414) Updating topology for /192.168.88.33 DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33 DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33 DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33 is down.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33 is down.
> DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java 
> (line 977) removing expire time for endpoint : /192.168.88.33 INFO 
> [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line 
> 978) InetAddress /192.168.88.33 is now UP DEBUG 
> [RequestResponseStage:1] 2016-04-19 11:18:18,353 MigrationManager.java 
> (line 102) Submitting migration task for /192.168.88.33 DEBUG 
> [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 
> 977) removing expire time for endpoint : /192.168.88.33 INFO 
> [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 
> 978) InetAddress /192.168.88.33 is now UP DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33 DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33 INFO [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,356 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33 .....
> 
> 
> On the otherhand, Node 1 keeps updating its gossip information, followed by receiving and submitting migrationTask afterwards:
> DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,332 Gossiper.java 
> (line 977) removing expire time for endpoint : /192.168.88.34 INFO 
> [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP DEBUG 
> [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 
> 977) removing expire time for endpoint : /192.168.88.34 INFO 
> [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34 INFO [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP ......
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,595 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,843 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,878 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
> ......
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java 
> (line 127) submitting migration task for /192.168.88.34 DEBUG 
> [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34 DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34 .....
> 
> Has anyone experienced this scenario? Thanks in advanced!
> 
> Sincerely,
> 
> Michael Fong
> 
> From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
> Sent: Wednesday, April 20, 2016 10:43 AM
> To: user@cassandra.apache.org<ma...@cassandra.apache.org>; 
> dev@cassandra.apache.org<ma...@cassandra.apache.org>
> Subject: Cassandra 2.0.x OOM during bootstrap
> 
> Hi, all,
> 
> We have recently encountered a Cassandra OOM issue when Cassandra is brought up sometimes (but not always) in our 4-node cluster test bed.
> 
> After analyzing the heap dump, we could find the Internal-Response thread pool (JMXEnabledThreadPoolExecutor) is filled with thounds of 'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of heap memory.
> 
> According to the documents on internet, it seems internal-response thread pool is somewhat related to schema-checking. Has anyone encountered similar issue before?
> 
> We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance!
> 
> Sincerely,
> 
> Michael Fong

Re: Cassandra 2.0.x OOM during startsup - schema version inconsistency after reboot

Posted by Michael Kjellman <mk...@internalcircle.com>.

I'd recommend you create a JIRA! That way you can get some traction on the issue. Obviously an OOM is never correct, even if your process is wrong in some way!

Best,
kjellman 

Sent from my iPhone

> On May 8, 2016, at 8:48 PM, Michael Fong <mi...@ruckuswireless.com> wrote:
> 
> Hi, all,
> 
> 
> Haven't heard any responses so far, and this isue has troubled us for quite some time. Here is another update:
> 
> We have noticed several times that The schema version may change after migration and reboot:
> 
> Here is the scenario:
> 
> 1.       Two node cluster (1 & 2).
> 
> 2.       There are some schema changes, i.e. create a few new columnfamily. The cluster will wait until both nodes have schema version in sync (describe cluster) before moving on.
> 
> 3.       Right before node2 is rebooted, the schema version is consistent; however, after ndoe2 reboots and starts servicing, the MigrationManager would gossip different schema version.
> 
> 4.       Afterwards, both nodes starts exchanging schema  message indefinitely until one of the node dies.
> 
> We currently suspect the change of schema is due to replying the old entry in commit log. We wish to continue dig further, but need experts help on this.
> 
> I don't know if anyone has seen this before, or if there is anything wrong with our migration flow though..
> 
> Thanks in advance.
> 
> Best regards,
> 
> 
> Michael Fong
> 
> From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
> Sent: Thursday, April 21, 2016 6:41 PM
> To: user@cassandra.apache.org; dev@cassandra.apache.org
> Subject: RE: Cassandra 2.0.x OOM during bootstrap
> 
> Hi, all,
> 
> Here is some more information on before the OOM happened on the rebooted node in a 2-node test cluster:
> 
> 
> 1.       It seems the schema version has changed on the rebooted node after reboot, i.e.
> Before reboot,
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 MigrationManager.java (line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 MigrationManager.java (line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> 
> After rebooting node 2,
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> 
> 
> 
> 2.       After reboot, both nods repeatedly send MigrationTask to each other - we suspect it is related to the schema version (Digest) mismatch after Node 2 rebooted:
> The node2  keeps submitting the migration task over 100+ times to the other node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating topology for /192.168.88.33
> INFO [GossipStage:1] 2016-04-19 11:18:18,263 StorageService.java (line 1544) Node /192.168.88.33 state jump to normal
> INFO [GossipStage:1] 2016-04-19 11:18:18,264 TokenMetadata.java (line 414) Updating topology for /192.168.88.33
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33 is down.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33 is down.
> DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33
> INFO [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP
> DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
> DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33
> INFO [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP
> DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
> DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33
> INFO [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP
> DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,356 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33
> .....
> 
> 
> On the otherhand, Node 1 keeps updating its gossip information, followed by receiving and submitting migrationTask afterwards:
> DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,332 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP
> DEBUG [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34
> INFO [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP
> DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP
> ......
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,595 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,843 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,878 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34.
> ......
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34
> .....
> 
> Has anyone experienced this scenario? Thanks in advanced!
> 
> Sincerely,
> 
> Michael Fong
> 
> From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
> Sent: Wednesday, April 20, 2016 10:43 AM
> To: user@cassandra.apache.org<ma...@cassandra.apache.org>; dev@cassandra.apache.org<ma...@cassandra.apache.org>
> Subject: Cassandra 2.0.x OOM during bootstrap
> 
> Hi, all,
> 
> We have recently encountered a Cassandra OOM issue when Cassandra is brought up sometimes (but not always) in our 4-node cluster test bed.
> 
> After analyzing the heap dump, we could find the Internal-Response thread pool (JMXEnabledThreadPoolExecutor) is filled with thounds of 'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of heap memory.
> 
> According to the documents on internet, it seems internal-response thread pool is somewhat related to schema-checking. Has anyone encountered similar issue before?
> 
> We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance!
> 
> Sincerely,
> 
> Michael Fong