You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Paulo Ricardo Motta Gomes <pa...@chaordicsystems.com> on 2015/06/12 16:31:16 UTC

connections remain on CLOSE_WAIT state after process is killed after upgrade to 2.0.15

Hello,

We recently upgraded a cluster from 2.0.12 to 2.0.15 and now whenever we
stop/kill a cassandra process, some other nodes keep a connection with the
dead node in the CLOSE_WAIT state on port 7000 for about 5-20 minutes.

So, if I start the killed node again, it cannot handshake with the nodes
which have a connection on the CLOSE_WAIT state until that connection is
closed, so they remain on the down state to each other for 5-20 minutes,
until they can handshake again.

I believe this is somehow related to the fixes CASSANDRA-8336 and
CASSANDRA-9238, and also could be a duplicate of CASSANDRA-8072. I will
continue to investigate to see if I find more evidences, but any help at
this point would be appreciated, or at least a confirmation that it could
be related to any of these tickets.

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br <http://www.chaordic.com.br/>*
+55 48 3232.3200

Re: Detect Repair is finish

Posted by Jean Tremblay <je...@zen-innovations.com>.
That is very useful. Thank you.

On 22 Jun,2015, at 18:26, Aaron Ploetz <aa...@gmail.com>> wrote:

You can do this with two nodetool commands:
nodetool compactionstats will check on active Merkle Tree calculations.

nodetool netstats will list out any active repair streams.

Aaron Morton posted a Bash script/command a while back that you can just paste into the (linux) command line to monitor active streams for you:


while true; do date; diff <(nodetool -h localhost netstats) <(sleep 5 && nodetool -h localhost netstats); done

Hope this helps,

Aaron Ploetz


On Mon, Jun 22, 2015 at 8:40 AM, Jean Tremblay <je...@zen-innovations.com>> wrote:
Hi,

What is the best way to see if a repair is finished? Is there a JMX object or is there a command to see if a repair is finished?
What happens if by mistake an operator starts a repair before the previous is not yet finished? Will they execute both one after the other or at the same time?

Thanks for your help.

Jean



Re: Detect Repair is finish

Posted by Aaron Ploetz <aa...@gmail.com>.
You can do this with two nodetool commands:
nodetool compactionstats will check on active Merkle Tree calculations.

nodetool netstats will list out any active repair streams.

Aaron Morton posted a Bash script/command a while back that you can just
paste into the (linux) command line to monitor active streams for you:

while true; do date; diff <(nodetool -h localhost netstats) <(sleep 5
&& nodetool -h localhost netstats); done

Hope this helps,

Aaron Ploetz


On Mon, Jun 22, 2015 at 8:40 AM, Jean Tremblay <
jean.tremblay@zen-innovations.com> wrote:

> Hi,
>
> What is the best way to see if a repair is finished? Is there a JMX object
> or is there a command to see if a repair is finished?
> What happens if by mistake an operator starts a repair before the previous
> is not yet finished? Will they execute both one after the other or at the
> same time?
>
> Thanks for your help.
>
> Jean
>
>

Re: After Restart Nodes had lost data

Posted by Jean Tremblay <je...@zen-innovations.com>.
No, I did not.



On 24 Jun 2015, at 06:05, Jason Wee <pe...@gmail.com>> wrote:

on the node 192.168.2.100, did you run repair after its status is UN?

On Wed, Jun 24, 2015 at 2:46 AM, Jean Tremblay <je...@zen-innovations.com>> wrote:
Dear Alain,

Thank you for your reply.

Ok, yes I did not drain. The cluster was loaded with tons of records, and no new records were added since few weeks. Each node had a load of about 160 GB as seen in the “nodetool status". I killed the cassandradeamon, and restarted it. After cassandra was restarted, I could see in the “nodetool status” a load of 5 GB!!

I don’t use counters.
I use RF 3 on 5 nodes. I did not change the replication factor.
I have two types of read queries. One use QUORUM for read, and the other use ONE for consistency level.
I did not change the Topology.

Are you sure this node had data before you restart it ?

Actually the full story is:

- I stopped node0(192.168.2.100), and I restarted it.
- I stopped node1(192.168.2.101).
- I made a nodetool status and I noticed that node0 was UN and had a load 5 GB. I found this really weird because all the other ones had about 160GB. I also saw that node1 was DN with a load of about 160GB.
- I restarted node1.
- I made a nodetool status and I noticed that node1 was UN and had a load of also 5GB, it previously had a load of about 160GB. That I’m sure.
- Then my program could no longer query C*. Neither the QUORUM nor the ONE consistency level statements could read data.

What does a "nodetool status mykeyspace" outputs ?

I cannot try this anymore. I flushed the whole cluster, and I am currently reloading everything. I was too much in a hurry. I have a demo tomorrow, and I will manage to have it back before tomorrow.

After my bad decision of flushing the cluster, I realised that I could have bootstrapped again my two nodes. Learning by doing.

"It’s like the whole cluster is paralysed" --> what does it mean, be more accurate on this please.

You should tell us action that were taken before this occurred and now what is not working since a C* cluster in this state could perfectly run. No SPOF.

What I did before? Well this cluster was basically idling. I was only making lots of select on it. I was loaded since few weeks.
But what I noticed when I restarted node0 is the following:

INFO  [InternalResponseStage:1] 2015-06-23 11:45:32,723 ColumnFamilyStore.java:882 - Enqueuing flush of schema_columnfamilies: 131587 (0%) on-heap, 0 (0%) off-heap
INFO  [MemtableFlushWriter:2] 2015-06-23 11:45:32,723 Memtable.java:346 - Writing Memtable-schema_columnfamilies@917967643(34850 serialized bytes, 585 ops, 0%/0% of on/off-heap limit)
>>>> WARN  [GossipTasks:1] 2015-06-23 11:45:33,459 FailureDetector.java:251 - Not marking nodes down due to local pause of 25509152054 > 5000000000
INFO  [MemtableFlushWriter:1] 2015-06-23 11:45:33,982 Memtable.java:385 - Completed flushing /home/maia/apache-cassandra-DATA/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-11-Data.db (5274 bytes) for commitlog position ReplayPos
ition(segmentId=1435052707645, position=144120)
INFO  [GossipStage:1] 2015-06-23 11:45:33,985 StorageService.java:1642 - Node /192.168.2.101<http://192.168.2.101> state jump to normal
INFO  [GossipStage:1] 2015-06-23 11:45:33,991 Gossiper.java:987 - Node /192.168.2.102<http://192.168.2.102> has restarted, now UP
INFO  [SharedPool-Worker-1] 2015-06-23 11:45:33,992 Gossiper.java:954 - InetAddress /192.168.2.102<http://192.168.2.102> is now UP
INFO  [HANDSHAKE-/192.168.2.102<http://192.168.2.102>] 2015-06-23 11:45:33,993 OutboundTcpConnection.java:485 - Handshaking version with /192.168.2.102<http://192.168.2.102>
INFO  [GossipStage:1] 2015-06-23 11:45:33,993 StorageService.java:1642 - Node /192.168.2.102<http://192.168.2.102> state jump to normal
INFO  [GossipStage:1] 2015-06-23 11:45:33,999 Gossiper.java:987 - Node /192.168.2.103<http://192.168.2.103> has restarted, now UP
INFO  [SharedPool-Worker-1] 2015-06-23 11:45:33,999 Gossiper.java:954 - InetAddress /192.168.2.103<http://192.168.2.103> is now UP
INFO  [GossipStage:1] 2015-06-23 11:45:34,001 StorageService.java:1642 - Node /192.168.2.103<http://192.168.2.103> state jump to normal
INFO  [HANDSHAKE-/192.168.2.103<http://192.168.2.103>] 2015-06-23 11:45:34,020 OutboundTcpConnection.java:485 - Handshaking version with /192.168.2.103<http://192.168.2.103>
INFO  [main] 2015-06-23 11:45:34,021 StorageService.java:1642 - Node zennode0/192.168.2.100<http://192.168.2.100> state jump to normal
INFO  [GossipStage:1] 2015-06-23 11:45:34,028 StorageService.java:1642 - Node /192.168.2.104<http://192.168.2.104> state jump to normal
INFO  [main] 2015-06-23 11:45:34,038 CassandraDaemon.java:583 - Waiting for gossip to settle before accepting client requests...
INFO  [GossipStage:1] 2015-06-23 11:45:34,039 StorageService.java:1642 - Node /192.168.2.101<http://192.168.2.101> state jump to normal
INFO  [GossipStage:1] 2015-06-23 11:45:34,047 StorageService.java:1642 - Node /192.168.2.103<http://192.168.2.103> state jump to normal
INFO  [GossipStage:1] 2015-06-23 11:45:34,055 StorageService.java:1642 - Node /192.168.2.102<http://192.168.2.102> state jump to normal
INFO  [CompactionExecutor:1] 2015-06-23 11:45:34,062 CompactionTask.java:270 - Compacted 1 sstables to [/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-2,].  171,593 bytes to 171,593 (~100% of orig
inal) in 1,544ms = 0.105987MB/s.  2 total partitions merged to 2.  Partition merge counts were {1:2, }
INFO  [CompactionExecutor:4] 2015-06-23 11:45:34,063 CompactionTask.java:140 - Compacting [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-2-Data.db')]
INFO  [GossipStage:1] 2015-06-23 11:45:34,066 StorageService.java:1642 - Node /192.168.2.102<http://192.168.2.102> state jump to normal
INFO  [MemtableFlushWriter:2] 2015-06-23 11:45:34,140 Memtable.java:385 - Completed flushing /home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-12-Data.db (8246 bytes)
for commitlog position ReplayPosition(segmentId=1435052707645, position=144120)
INFO  [CompactionExecutor:2] 2015-06-23 11:45:34,140 CompactionTask.java:140 - Compacting [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-11-Dat
a.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-10-Data.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_col
umnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-9-Data.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-12-Data
.db')]
INFO  [InternalResponseStage:1] 2015-06-23 11:45:34,140 ColumnFamilyStore.java:882 - Enqueuing flush of schema_columns: 284282 (0%) on-heap, 0 (0%) off-heap
INFO  [MemtableFlushWriter:1] 2015-06-23 11:45:34,141 Memtable.java:346 - Writing Memtable-schema_columns@813508218(55024 serialized bytes, 1041 ops, 0%/0% of on/off-heap limit)
INFO  [CompactionExecutor:4] 2015-06-23 11:45:37,980 CompactionTask.java:270 - Compacted 1 sstables to [/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-3,].  171,593 bytes to 171,593 (~100% of orig
inal) in 3,916ms = 0.041789MB/s.  2 total partitions merged to 2.  Partition merge counts were {1:2, }
INFO  [MemtableFlushWriter:1] 2015-06-23 11:45:39,265 Memtable.java:385 - Completed flushing /home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-12-Data.db (12966 bytes) for commitlog
 position ReplayPosition(segmentId=1435052707645, position=251695)
INFO  [CompactionExecutor:1] 2015-06-23 11:45:39,266 CompactionTask.java:140 - Compacting [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-10-Data.db'), SSTabl
eReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-11-Data.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17
d3df2122a/system-schema_columns-ka-9-Data.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-12-Data.db')]
INFO  [InternalResponseStage:1] 2015-06-23 11:45:39,271 ColumnFamilyStore.java:882 - Enqueuing flush of schema_triggers: 154 (0%) on-heap, 0 (0%) off-heap
INFO  [MemtableFlushWriter:2] 2015-06-23 11:45:39,271 Memtable.java:346 - Writing Memtable-schema_triggers@652166139(8 serialized bytes, 1 ops, 0%/0% of on/off-heap limit)
INFO  [CompactionExecutor:2] 2015-06-23 11:45:39,275 CompactionTask.java:270 - Compacted 4 sstables to [/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-13,].  17,971
 bytes to 8,246 (~45% of original) in 5,134ms = 0.001532MB/s.  8 total partitions merged to 3.  Partition merge counts were {2:2, 4:1, }
INFO  [main] 2015-06-23 11:45:42,042 CassandraDaemon.java:615 - No gossip backlog; proceeding


I don’t know what the warning means. WARN  [GossipTasks:1] 2015-06-23 11:45:33,459 FailureDetector.java:251 - Not marking nodes down due to local pause of 25509152054 > 5000000000

And shortly after the proceeding…

INFO  [HANDSHAKE-zennode0/192.168.2.100<http://192.168.2.100>] 2015-06-23 11:46:00,566 OutboundTcpConnection.java:485 - Handshaking version with zennode0/192.168.2.100<http://192.168.2.100>
WARN  [MessagingService-Incoming-/192.168.2.103<http://192.168.2.103>] 2015-06-23 11:46:00,595 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:69) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:188) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:170) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88) ~[apache-cassandra-2.1.6.jar:2.1.6]
WARN  [MessagingService-Incoming-/192.168.2.102<http://192.168.2.102>] 2015-06-23 11:46:00,629 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:69) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:188) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:170) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88) ~[apache-cassandra-2.1.6.jar:2.1.6]
WARN  [MessagingService-Incoming-/192.168.2.103<http://192.168.2.103>] 2015-06-23 11:46:00,633 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74) ~[apache-cassandra-2.1.6.jar:2.1.6]


That seems to me more nasty. I don’t know what it means, but I have the impression that the nodes which I did not touch seem now to talk with node0 about a table which the latter does not know about!!! A table node0 “forgot”.


Well I don’t ask for miracle here. If I would have noticed that there was already a problem after restarted the first node I would have simply fixed that node before doing anything else… but I did not noticed this.

Thanks for your help...




On 23 Jun 2015, at 19:24 , Alain RODRIGUEZ <ar...@gmail.com>> wrote:

Hi Jean,

"I had to reboot a node. I killed the cassandra process on that node". You should drain the node before killing java (or using any service stop command). This is not what causes the issue yet it will help you to keep consistence if you use counters, and make the reboot faster in any cases.

What is going on highly depends on what you did before.

Did you change the RF ?
Did you change the Topology ?
Are you sure this node had data before you restart it ?
What does a "nodetool status mykeyspace" outputs ?

To make the join faster you could try to bootstrap the node again. I just hope you have a RF > 1 (btw, you have one replica down, if you want to still have Reads / Writes working, take care of having a Consistency Level low enough).

"It’s like the whole cluster is paralysed" --> what does it mean, be more accurate on this please.

You should tell us action that were taken before this occurred and now what is not working since a C* cluster in this state could perfectly run. No SPOF.

C*heers

2015-06-23 16:10 GMT+02:00 Jean Tremblay <je...@zen-innovations.com>>:
Does anyone know what to do when such an event occurs?
Does anyone know how this could happen?

I would have tried repairing the node with nodetool repair but that takes much too long… I need my cluster to work for our online system. At this point nothing is working. It’s like the whole cluster is paralysed.
The only solution I see is to remove temporarily that node.

Thanks for your comments.

On 23 Jun 2015, at 12:45 , Jean Tremblay <je...@zen-innovations.com>> wrote:

Hi,

I have a cluster with 5 nodes running Cassandra 2.1.6.

I had to reboot a node. I killed the cassandra process on that node. Rebooted the machine, and restarted Cassandra.

~/apache-cassandra-DATA/data:321>nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  192.168.2.104  158.27 GB  256     ?       6479205e-6a19-49a8-b1a1-7e788ec29caa  rack1
UN  192.168.2.100  4.75 GB    256     ?       e821da50-23c6-4ea0-b3a1-275ded63bc1f  rack1
UN  192.168.2.101  157.43 GB  256     ?       01525665-bacc-4207-a8c3-eb4fd9532401  rack1
UN  192.168.2.102  159.27 GB  256     ?       596a33d7-5089-4c7e-a9ad-e1f22111b160  rack1
UN  192.168.2.103  167 GB     256     ?       0ce1d48e-57a9-4615-8e12-d7ef3d621c7d  rack1


After restarting node 192.168.2.100 I noticed that its load was diminish to 2%. And now when I query the cluster my queries are bombing and that node times out with an error

WARN  [MessagingService-Incoming-/192.168.2.102<http://192.168.2.102/>] 2015-06-23 12:26:00,056 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserializeOneCf(Mutation.java:322) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:302) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330) ~[apache-cassandra-2.1.6.jar:2.1.6]

What is going on? Did anyone live something like that?





Re: After Restart Nodes had lost data

Posted by Jason Wee <pe...@gmail.com>.
on the node 192.168.2.100, did you run repair after its status is UN?

On Wed, Jun 24, 2015 at 2:46 AM, Jean Tremblay <
jean.tremblay@zen-innovations.com> wrote:

>  Dear Alain,
>
>  Thank you for your reply.
>
>  Ok, yes I did not drain. The cluster was loaded with tons of records,
> and no new records were added since few weeks. Each node had a load of
> about 160 GB as seen in the “nodetool status". I killed the
> cassandradeamon, and restarted it. After cassandra was restarted, I could
> see in the “nodetool status” a load of 5 GB!!
>
>  I don’t use counters.
> I use RF 3 on 5 nodes. I did not change the replication factor.
> I have two types of read queries. One use QUORUM for read, and the other
> use ONE for consistency level.
> I did not change the Topology.
>
>   Are you sure this node had data before you restart it ?
>
>
>  Actually the full story is:
>
>  - I stopped node0(192.168.2.100), and I restarted it.
> - I stopped node1(192.168.2.101).
> - I made a nodetool status and I noticed that node0 was UN and had a load
> 5 GB. I found this really weird because all the other ones had about
> 160GB. I also saw that node1 was DN with a load of about 160GB.
> - I restarted node1.
> - I made a nodetool status and I noticed that node1 was UN and had a load
> of also 5GB, it previously had a load of about 160GB. That I’m sure.
> - Then my program could no longer query C*. Neither the QUORUM nor the ONE
> consistency level statements could read data.
>
>   What does a "nodetool status mykeyspace" outputs ?
>
>
>  I cannot try this anymore. I flushed the whole cluster, and I am
> currently reloading everything. I was too much in a hurry. I have a demo
> tomorrow, and I will manage to have it back before tomorrow.
>
>  After my bad decision of flushing the cluster, I realised that I could
> have bootstrapped again my two nodes. Learning by doing.
>
>   "It’s like the whole cluster is paralysed" --> what does it mean, be
> more accurate on this please.
>
>  You should tell us action that were taken before this occurred and now
> what is not working since a C* cluster in this state could perfectly run.
> No SPOF.
>
>
>  What I did before? Well this cluster was basically idling. I was only
> making lots of select on it. I was loaded since few weeks.
> But what I noticed when I restarted node0 is the following:
>
>  INFO  [InternalResponseStage:1] 2015-06-23 11:45:32,723
> ColumnFamilyStore.java:882 - Enqueuing flush of schema_columnfamilies:
> 131587 (0%) on-heap, 0 (0%) off-heap
> INFO  [MemtableFlushWriter:2] 2015-06-23 11:45:32,723 Memtable.java:346 -
> Writing Memtable-schema_columnfamilies@917967643(34850 serialized bytes,
> 585 ops, 0%/0% of on/off-heap limit)
> >>>> WARN  [GossipTasks:1] 2015-06-23 11:45:33,459
> FailureDetector.java:251 - Not marking nodes down due to local pause of
> 25509152054 > 5000000000
> INFO  [MemtableFlushWriter:1] 2015-06-23 11:45:33,982 Memtable.java:385 -
> Completed flushing
> /home/maia/apache-cassandra-DATA/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-11-Data.db
> (5274 bytes) for commitlog position ReplayPos
> ition(segmentId=1435052707645, position=144120)
> INFO  [GossipStage:1] 2015-06-23 11:45:33,985 StorageService.java:1642 -
> Node /192.168.2.101 state jump to normal
> INFO  [GossipStage:1] 2015-06-23 11:45:33,991 Gossiper.java:987 - Node /
> 192.168.2.102 has restarted, now UP
> INFO  [SharedPool-Worker-1] 2015-06-23 11:45:33,992 Gossiper.java:954 -
> InetAddress /192.168.2.102 is now UP
> INFO  [HANDSHAKE-/192.168.2.102] 2015-06-23 11:45:33,993
> OutboundTcpConnection.java:485 - Handshaking version with /192.168.2.102
> INFO  [GossipStage:1] 2015-06-23 11:45:33,993 StorageService.java:1642 -
> Node /192.168.2.102 state jump to normal
> INFO  [GossipStage:1] 2015-06-23 11:45:33,999 Gossiper.java:987 - Node /
> 192.168.2.103 has restarted, now UP
> INFO  [SharedPool-Worker-1] 2015-06-23 11:45:33,999 Gossiper.java:954 -
> InetAddress /192.168.2.103 is now UP
> INFO  [GossipStage:1] 2015-06-23 11:45:34,001 StorageService.java:1642 -
> Node /192.168.2.103 state jump to normal
> INFO  [HANDSHAKE-/192.168.2.103] 2015-06-23 11:45:34,020
> OutboundTcpConnection.java:485 - Handshaking version with /192.168.2.103
> INFO  [main] 2015-06-23 11:45:34,021 StorageService.java:1642 - Node
> zennode0/192.168.2.100 state jump to normal
> INFO  [GossipStage:1] 2015-06-23 11:45:34,028 StorageService.java:1642 -
> Node /192.168.2.104 state jump to normal
> INFO  [main] 2015-06-23 11:45:34,038 CassandraDaemon.java:583 - Waiting
> for gossip to settle before accepting client requests...
> INFO  [GossipStage:1] 2015-06-23 11:45:34,039 StorageService.java:1642 -
> Node /192.168.2.101 state jump to normal
> INFO  [GossipStage:1] 2015-06-23 11:45:34,047 StorageService.java:1642 -
> Node /192.168.2.103 state jump to normal
> INFO  [GossipStage:1] 2015-06-23 11:45:34,055 StorageService.java:1642 -
> Node /192.168.2.102 state jump to normal
> INFO  [CompactionExecutor:1] 2015-06-23 11:45:34,062
> CompactionTask.java:270 - Compacted 1 sstables to
> [/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-2,].
> 171,593 bytes to 171,593 (~100% of orig
> inal) in 1,544ms = 0.105987MB/s.  2 total partitions merged to 2.
> Partition merge counts were {1:2, }
> INFO  [CompactionExecutor:4] 2015-06-23 11:45:34,063
> CompactionTask.java:140 - Compacting
> [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-2-Data.db')]
> INFO  [GossipStage:1] 2015-06-23 11:45:34,066 StorageService.java:1642 -
> Node /192.168.2.102 state jump to normal
> INFO  [MemtableFlushWriter:2] 2015-06-23 11:45:34,140 Memtable.java:385 -
> Completed flushing
> /home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-12-Data.db
> (8246 bytes)
> for commitlog position ReplayPosition(segmentId=1435052707645,
> position=144120)
> INFO  [CompactionExecutor:2] 2015-06-23 11:45:34,140
> CompactionTask.java:140 - Compacting
> [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-11-Dat
> a.db'),
> SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-10-Data.db'),
> SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_col
> umnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-9-Data.db'),
> SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-12-Data
> .db')]
> INFO  [InternalResponseStage:1] 2015-06-23 11:45:34,140
> ColumnFamilyStore.java:882 - Enqueuing flush of schema_columns: 284282 (0%)
> on-heap, 0 (0%) off-heap
> INFO  [MemtableFlushWriter:1] 2015-06-23 11:45:34,141 Memtable.java:346 -
> Writing Memtable-schema_columns@813508218(55024 serialized bytes, 1041
> ops, 0%/0% of on/off-heap limit)
> INFO  [CompactionExecutor:4] 2015-06-23 11:45:37,980
> CompactionTask.java:270 - Compacted 1 sstables to
> [/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-3,].
> 171,593 bytes to 171,593 (~100% of orig
> inal) in 3,916ms = 0.041789MB/s.  2 total partitions merged to 2.
> Partition merge counts were {1:2, }
> INFO  [MemtableFlushWriter:1] 2015-06-23 11:45:39,265 Memtable.java:385 -
> Completed flushing
> /home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-12-Data.db
> (12966 bytes) for commitlog
>  position ReplayPosition(segmentId=1435052707645, position=251695)
> INFO  [CompactionExecutor:1] 2015-06-23 11:45:39,266
> CompactionTask.java:140 - Compacting
> [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-10-Data.db'),
> SSTabl
> eReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-11-Data.db'),
> SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17
> d3df2122a/system-schema_columns-ka-9-Data.db'),
> SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-12-Data.db')]
> INFO  [InternalResponseStage:1] 2015-06-23 11:45:39,271
> ColumnFamilyStore.java:882 - Enqueuing flush of schema_triggers: 154 (0%)
> on-heap, 0 (0%) off-heap
> INFO  [MemtableFlushWriter:2] 2015-06-23 11:45:39,271 Memtable.java:346 -
> Writing Memtable-schema_triggers@652166139(8 serialized bytes, 1 ops,
> 0%/0% of on/off-heap limit)
> INFO  [CompactionExecutor:2] 2015-06-23 11:45:39,275
> CompactionTask.java:270 - Compacted 4 sstables to
> [/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-13,].
> 17,971
>  bytes to 8,246 (~45% of original) in 5,134ms = 0.001532MB/s.  8 total
> partitions merged to 3.  Partition merge counts were {2:2, 4:1, }
> INFO  [main] 2015-06-23 11:45:42,042 CassandraDaemon.java:615 - No gossip
> backlog; proceeding
>
>
>      I don’t know what the warning means. WARN  [GossipTasks:1]
> 2015-06-23 11:45:33,459 FailureDetector.java:251 - Not marking nodes down
> due to local pause of 25509152054 > 5000000000
>
>  And shortly after the proceeding…
>
>  INFO  [HANDSHAKE-zennode0/192.168.2.100] 2015-06-23 11:46:00,566
> OutboundTcpConnection.java:485 - Handshaking version with zennode0/
> 192.168.2.100
> WARN  [MessagingService-Incoming-/192.168.2.103] 2015-06-23 11:46:00,595
> IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from
> socket; closing
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
> cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
> at
> org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:97)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:69)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:188)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:170)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> WARN  [MessagingService-Incoming-/192.168.2.102] 2015-06-23 11:46:00,629
> IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from
> socket; closing
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
> cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
> at
> org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:97)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:69)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:188)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:170)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> WARN  [MessagingService-Incoming-/192.168.2.103] 2015-06-23 11:46:00,633
> IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from
> socket; closing
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
> cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
> at
> org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
>
>
>  That seems to me more nasty. I don’t know what it means, but I have the
> impression that the nodes which I did not touch seem now to talk with node0
> about a table which the latter does not know about!!! A table node0
> “forgot”.
>
>
>  Well I don’t ask for miracle here. If I would have noticed that there
> was already a problem after restarted the first node I would have simply
> fixed that node before doing anything else… but I did not noticed this.
>
>  Thanks for your help...
>
>
>
>
>  On 23 Jun 2015, at 19:24 , Alain RODRIGUEZ <ar...@gmail.com> wrote:
>
>  Hi Jean,
>
>  "I had to reboot a node. I killed the cassandra process on that node".
> You should drain the node before killing java (or using any service stop
> command). This is not what causes the issue yet it will help you to keep
> consistence if you use counters, and make the reboot faster in any cases.
>
>  What is going on highly depends on what you did before.
>
>  Did you change the RF ?
> Did you change the Topology ?
> Are you sure this node had data before you restart it ?
> What does a "nodetool status mykeyspace" outputs ?
>
>  To make the join faster you could try to bootstrap the node again. I
> just hope you have a RF > 1 (btw, you have one replica down, if you want to
> still have Reads / Writes working, take care of having a Consistency Level
> low enough).
>
>  "It’s like the whole cluster is paralysed" --> what does it mean, be
> more accurate on this please.
>
>  You should tell us action that were taken before this occurred and now
> what is not working since a C* cluster in this state could perfectly run.
> No SPOF.
>
>  C*heers
>
> 2015-06-23 16:10 GMT+02:00 Jean Tremblay <
> jean.tremblay@zen-innovations.com>:
>
>> Does anyone know what to do when such an event occurs?
>> Does anyone know how this could happen?
>>
>>  I would have tried repairing the node with nodetool repair but that
>> takes much too long… I need my cluster to work for our online system. At
>> this point nothing is working. It’s like the whole cluster is paralysed.
>> The only solution I see is to remove temporarily that node.
>>
>>  Thanks for your comments.
>>
>>  On 23 Jun 2015, at 12:45 , Jean Tremblay <
>> jean.tremblay@zen-innovations.com> wrote:
>>
>>  Hi,
>>
>>  I have a cluster with 5 nodes running Cassandra 2.1.6.
>>
>>  I had to reboot a node. I killed the cassandra process on that node.
>> Rebooted the machine, and restarted Cassandra.
>>
>>   ~/apache-cassandra-DATA/data:321>nodetool status
>> Datacenter: datacenter1
>> =======================
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens  Owns    Host ID
>>           Rack
>> UN  192.168.2.104  158.27 GB  256     ?
>> 6479205e-6a19-49a8-b1a1-7e788ec29caa  rack1
>> UN  192.168.2.100  4.75 GB    256     ?
>> e821da50-23c6-4ea0-b3a1-275ded63bc1f  rack1
>> UN  192.168.2.101  157.43 GB  256     ?
>> 01525665-bacc-4207-a8c3-eb4fd9532401  rack1
>> UN  192.168.2.102  159.27 GB  256     ?
>> 596a33d7-5089-4c7e-a9ad-e1f22111b160  rack1
>> UN  192.168.2.103  167 GB     256     ?
>> 0ce1d48e-57a9-4615-8e12-d7ef3d621c7d  rack1
>>
>>
>>  After restarting node 192.168.2.100 I noticed that its load was
>> diminish to 2%. And now when I query the cluster my queries are bombing and
>> that node times out with an error
>>
>>  WARN  [MessagingService-Incoming-/192.168.2.102] 2015-06-23
>> 12:26:00,056 IncomingTcpConnection.java:97 - UnknownColumnFamilyException
>> reading from socket; closing
>> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
>> cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
>> at
>> org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164)
>> ~[apache-cassandra-2.1.6.jar:2.1.6]
>> at
>> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97)
>> ~[apache-cassandra-2.1.6.jar:2.1.6]
>> at
>> org.apache.cassandra.db.Mutation$MutationSerializer.deserializeOneCf(Mutation.java:322)
>> ~[apache-cassandra-2.1.6.jar:2.1.6]
>> at
>> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:302)
>> ~[apache-cassandra-2.1.6.jar:2.1.6]
>> at
>> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330)
>> ~[apache-cassandra-2.1.6.jar:2.1.6]
>>
>>  What is going on? Did anyone live something like that?
>>
>>
>>
>
>

Re: After Restart Nodes had lost data

Posted by Jean Tremblay <je...@zen-innovations.com>.
Dear Alain,

Thank you for your reply.

Ok, yes I did not drain. The cluster was loaded with tons of records, and no new records were added since few weeks. Each node had a load of about 160 GB as seen in the “nodetool status". I killed the cassandradeamon, and restarted it. After cassandra was restarted, I could see in the “nodetool status” a load of 5 GB!!

I don’t use counters.
I use RF 3 on 5 nodes. I did not change the replication factor.
I have two types of read queries. One use QUORUM for read, and the other use ONE for consistency level.
I did not change the Topology.

Are you sure this node had data before you restart it ?

Actually the full story is:

- I stopped node0(192.168.2.100), and I restarted it.
- I stopped node1(192.168.2.101).
- I made a nodetool status and I noticed that node0 was UN and had a load 5 GB. I found this really weird because all the other ones had about 160GB. I also saw that node1 was DN with a load of about 160GB.
- I restarted node1.
- I made a nodetool status and I noticed that node1 was UN and had a load of also 5GB, it previously had a load of about 160GB. That I’m sure.
- Then my program could no longer query C*. Neither the QUORUM nor the ONE consistency level statements could read data.

What does a "nodetool status mykeyspace" outputs ?

I cannot try this anymore. I flushed the whole cluster, and I am currently reloading everything. I was too much in a hurry. I have a demo tomorrow, and I will manage to have it back before tomorrow.

After my bad decision of flushing the cluster, I realised that I could have bootstrapped again my two nodes. Learning by doing.

"It’s like the whole cluster is paralysed" --> what does it mean, be more accurate on this please.

You should tell us action that were taken before this occurred and now what is not working since a C* cluster in this state could perfectly run. No SPOF.

What I did before? Well this cluster was basically idling. I was only making lots of select on it. I was loaded since few weeks.
But what I noticed when I restarted node0 is the following:

INFO  [InternalResponseStage:1] 2015-06-23 11:45:32,723 ColumnFamilyStore.java:882 - Enqueuing flush of schema_columnfamilies: 131587 (0%) on-heap, 0 (0%) off-heap
INFO  [MemtableFlushWriter:2] 2015-06-23 11:45:32,723 Memtable.java:346 - Writing Memtable-schema_columnfamilies@917967643(34850 serialized bytes, 585 ops, 0%/0% of on/off-heap limit)
>>>> WARN  [GossipTasks:1] 2015-06-23 11:45:33,459 FailureDetector.java:251 - Not marking nodes down due to local pause of 25509152054 > 5000000000
INFO  [MemtableFlushWriter:1] 2015-06-23 11:45:33,982 Memtable.java:385 - Completed flushing /home/maia/apache-cassandra-DATA/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-11-Data.db (5274 bytes) for commitlog position ReplayPos
ition(segmentId=1435052707645, position=144120)
INFO  [GossipStage:1] 2015-06-23 11:45:33,985 StorageService.java:1642 - Node /192.168.2.101 state jump to normal
INFO  [GossipStage:1] 2015-06-23 11:45:33,991 Gossiper.java:987 - Node /192.168.2.102 has restarted, now UP
INFO  [SharedPool-Worker-1] 2015-06-23 11:45:33,992 Gossiper.java:954 - InetAddress /192.168.2.102 is now UP
INFO  [HANDSHAKE-/192.168.2.102] 2015-06-23 11:45:33,993 OutboundTcpConnection.java:485 - Handshaking version with /192.168.2.102
INFO  [GossipStage:1] 2015-06-23 11:45:33,993 StorageService.java:1642 - Node /192.168.2.102 state jump to normal
INFO  [GossipStage:1] 2015-06-23 11:45:33,999 Gossiper.java:987 - Node /192.168.2.103 has restarted, now UP
INFO  [SharedPool-Worker-1] 2015-06-23 11:45:33,999 Gossiper.java:954 - InetAddress /192.168.2.103 is now UP
INFO  [GossipStage:1] 2015-06-23 11:45:34,001 StorageService.java:1642 - Node /192.168.2.103 state jump to normal
INFO  [HANDSHAKE-/192.168.2.103] 2015-06-23 11:45:34,020 OutboundTcpConnection.java:485 - Handshaking version with /192.168.2.103
INFO  [main] 2015-06-23 11:45:34,021 StorageService.java:1642 - Node zennode0/192.168.2.100 state jump to normal
INFO  [GossipStage:1] 2015-06-23 11:45:34,028 StorageService.java:1642 - Node /192.168.2.104 state jump to normal
INFO  [main] 2015-06-23 11:45:34,038 CassandraDaemon.java:583 - Waiting for gossip to settle before accepting client requests...
INFO  [GossipStage:1] 2015-06-23 11:45:34,039 StorageService.java:1642 - Node /192.168.2.101 state jump to normal
INFO  [GossipStage:1] 2015-06-23 11:45:34,047 StorageService.java:1642 - Node /192.168.2.103 state jump to normal
INFO  [GossipStage:1] 2015-06-23 11:45:34,055 StorageService.java:1642 - Node /192.168.2.102 state jump to normal
INFO  [CompactionExecutor:1] 2015-06-23 11:45:34,062 CompactionTask.java:270 - Compacted 1 sstables to [/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-2,].  171,593 bytes to 171,593 (~100% of orig
inal) in 1,544ms = 0.105987MB/s.  2 total partitions merged to 2.  Partition merge counts were {1:2, }
INFO  [CompactionExecutor:4] 2015-06-23 11:45:34,063 CompactionTask.java:140 - Compacting [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-2-Data.db')]
INFO  [GossipStage:1] 2015-06-23 11:45:34,066 StorageService.java:1642 - Node /192.168.2.102 state jump to normal
INFO  [MemtableFlushWriter:2] 2015-06-23 11:45:34,140 Memtable.java:385 - Completed flushing /home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-12-Data.db (8246 bytes)
for commitlog position ReplayPosition(segmentId=1435052707645, position=144120)
INFO  [CompactionExecutor:2] 2015-06-23 11:45:34,140 CompactionTask.java:140 - Compacting [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-11-Dat
a.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-10-Data.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_col
umnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-9-Data.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-12-Data
.db')]
INFO  [InternalResponseStage:1] 2015-06-23 11:45:34,140 ColumnFamilyStore.java:882 - Enqueuing flush of schema_columns: 284282 (0%) on-heap, 0 (0%) off-heap
INFO  [MemtableFlushWriter:1] 2015-06-23 11:45:34,141 Memtable.java:346 - Writing Memtable-schema_columns@813508218(55024 serialized bytes, 1041 ops, 0%/0% of on/off-heap limit)
INFO  [CompactionExecutor:4] 2015-06-23 11:45:37,980 CompactionTask.java:270 - Compacted 1 sstables to [/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-3,].  171,593 bytes to 171,593 (~100% of orig
inal) in 3,916ms = 0.041789MB/s.  2 total partitions merged to 2.  Partition merge counts were {1:2, }
INFO  [MemtableFlushWriter:1] 2015-06-23 11:45:39,265 Memtable.java:385 - Completed flushing /home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-12-Data.db (12966 bytes) for commitlog
 position ReplayPosition(segmentId=1435052707645, position=251695)
INFO  [CompactionExecutor:1] 2015-06-23 11:45:39,266 CompactionTask.java:140 - Compacting [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-10-Data.db'), SSTabl
eReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-11-Data.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17
d3df2122a/system-schema_columns-ka-9-Data.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-12-Data.db')]
INFO  [InternalResponseStage:1] 2015-06-23 11:45:39,271 ColumnFamilyStore.java:882 - Enqueuing flush of schema_triggers: 154 (0%) on-heap, 0 (0%) off-heap
INFO  [MemtableFlushWriter:2] 2015-06-23 11:45:39,271 Memtable.java:346 - Writing Memtable-schema_triggers@652166139(8 serialized bytes, 1 ops, 0%/0% of on/off-heap limit)
INFO  [CompactionExecutor:2] 2015-06-23 11:45:39,275 CompactionTask.java:270 - Compacted 4 sstables to [/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-13,].  17,971
 bytes to 8,246 (~45% of original) in 5,134ms = 0.001532MB/s.  8 total partitions merged to 3.  Partition merge counts were {2:2, 4:1, }
INFO  [main] 2015-06-23 11:45:42,042 CassandraDaemon.java:615 - No gossip backlog; proceeding


I don’t know what the warning means. WARN  [GossipTasks:1] 2015-06-23 11:45:33,459 FailureDetector.java:251 - Not marking nodes down due to local pause of 25509152054 > 5000000000

And shortly after the proceeding…

INFO  [HANDSHAKE-zennode0/192.168.2.100] 2015-06-23 11:46:00,566 OutboundTcpConnection.java:485 - Handshaking version with zennode0/192.168.2.100
WARN  [MessagingService-Incoming-/192.168.2.103] 2015-06-23 11:46:00,595 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:69) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:188) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:170) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88) ~[apache-cassandra-2.1.6.jar:2.1.6]
WARN  [MessagingService-Incoming-/192.168.2.102] 2015-06-23 11:46:00,629 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:69) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:188) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:170) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88) ~[apache-cassandra-2.1.6.jar:2.1.6]
WARN  [MessagingService-Incoming-/192.168.2.103] 2015-06-23 11:46:00,633 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74) ~[apache-cassandra-2.1.6.jar:2.1.6]


That seems to me more nasty. I don’t know what it means, but I have the impression that the nodes which I did not touch seem now to talk with node0 about a table which the latter does not know about!!! A table node0 “forgot”.


Well I don’t ask for miracle here. If I would have noticed that there was already a problem after restarted the first node I would have simply fixed that node before doing anything else… but I did not noticed this.

Thanks for your help...




On 23 Jun 2015, at 19:24 , Alain RODRIGUEZ <ar...@gmail.com>> wrote:

Hi Jean,

"I had to reboot a node. I killed the cassandra process on that node". You should drain the node before killing java (or using any service stop command). This is not what causes the issue yet it will help you to keep consistence if you use counters, and make the reboot faster in any cases.

What is going on highly depends on what you did before.

Did you change the RF ?
Did you change the Topology ?
Are you sure this node had data before you restart it ?
What does a "nodetool status mykeyspace" outputs ?

To make the join faster you could try to bootstrap the node again. I just hope you have a RF > 1 (btw, you have one replica down, if you want to still have Reads / Writes working, take care of having a Consistency Level low enough).

"It’s like the whole cluster is paralysed" --> what does it mean, be more accurate on this please.

You should tell us action that were taken before this occurred and now what is not working since a C* cluster in this state could perfectly run. No SPOF.

C*heers

2015-06-23 16:10 GMT+02:00 Jean Tremblay <je...@zen-innovations.com>>:
Does anyone know what to do when such an event occurs?
Does anyone know how this could happen?

I would have tried repairing the node with nodetool repair but that takes much too long… I need my cluster to work for our online system. At this point nothing is working. It’s like the whole cluster is paralysed.
The only solution I see is to remove temporarily that node.

Thanks for your comments.

On 23 Jun 2015, at 12:45 , Jean Tremblay <je...@zen-innovations.com>> wrote:

Hi,

I have a cluster with 5 nodes running Cassandra 2.1.6.

I had to reboot a node. I killed the cassandra process on that node. Rebooted the machine, and restarted Cassandra.

~/apache-cassandra-DATA/data:321>nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  192.168.2.104  158.27 GB  256     ?       6479205e-6a19-49a8-b1a1-7e788ec29caa  rack1
UN  192.168.2.100  4.75 GB    256     ?       e821da50-23c6-4ea0-b3a1-275ded63bc1f  rack1
UN  192.168.2.101  157.43 GB  256     ?       01525665-bacc-4207-a8c3-eb4fd9532401  rack1
UN  192.168.2.102  159.27 GB  256     ?       596a33d7-5089-4c7e-a9ad-e1f22111b160  rack1
UN  192.168.2.103  167 GB     256     ?       0ce1d48e-57a9-4615-8e12-d7ef3d621c7d  rack1


After restarting node 192.168.2.100 I noticed that its load was diminish to 2%. And now when I query the cluster my queries are bombing and that node times out with an error

WARN  [MessagingService-Incoming-/192.168.2.102<http://192.168.2.102/>] 2015-06-23 12:26:00,056 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserializeOneCf(Mutation.java:322) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:302) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330) ~[apache-cassandra-2.1.6.jar:2.1.6]

What is going on? Did anyone live something like that?




Re: After Restart Nodes had lost data

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
Hi Jean,

"I had to reboot a node. I killed the cassandra process on that node". You
should drain the node before killing java (or using any service stop
command). This is not what causes the issue yet it will help you to keep
consistence if you use counters, and make the reboot faster in any cases.

What is going on highly depends on what you did before.

Did you change the RF ?
Did you change the Topology ?
Are you sure this node had data before you restart it ?
What does a "nodetool status mykeyspace" outputs ?

To make the join faster you could try to bootstrap the node again. I just
hope you have a RF > 1 (btw, you have one replica down, if you want to
still have Reads / Writes working, take care of having a Consistency Level
low enough).

"It’s like the whole cluster is paralysed" --> what does it mean, be more
accurate on this please.

You should tell us action that were taken before this occurred and now what
is not working since a C* cluster in this state could perfectly run. No
SPOF.

C*heers

2015-06-23 16:10 GMT+02:00 Jean Tremblay <je...@zen-innovations.com>
:

>  Does anyone know what to do when such an event occurs?
> Does anyone know how this could happen?
>
>  I would have tried repairing the node with nodetool repair but that
> takes much too long… I need my cluster to work for our online system. At
> this point nothing is working. It’s like the whole cluster is paralysed.
> The only solution I see is to remove temporarily that node.
>
>  Thanks for your comments.
>
>  On 23 Jun 2015, at 12:45 , Jean Tremblay <
> jean.tremblay@zen-innovations.com> wrote:
>
>  Hi,
>
>  I have a cluster with 5 nodes running Cassandra 2.1.6.
>
>  I had to reboot a node. I killed the cassandra process on that node.
> Rebooted the machine, and restarted Cassandra.
>
>   ~/apache-cassandra-DATA/data:321>nodetool status
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens  Owns    Host ID
>           Rack
> UN  192.168.2.104  158.27 GB  256     ?
> 6479205e-6a19-49a8-b1a1-7e788ec29caa  rack1
> UN  192.168.2.100  4.75 GB    256     ?
> e821da50-23c6-4ea0-b3a1-275ded63bc1f  rack1
> UN  192.168.2.101  157.43 GB  256     ?
> 01525665-bacc-4207-a8c3-eb4fd9532401  rack1
> UN  192.168.2.102  159.27 GB  256     ?
> 596a33d7-5089-4c7e-a9ad-e1f22111b160  rack1
> UN  192.168.2.103  167 GB     256     ?
> 0ce1d48e-57a9-4615-8e12-d7ef3d621c7d  rack1
>
>
>  After restarting node 192.168.2.100 I noticed that its load was diminish
> to 2%. And now when I query the cluster my queries are bombing and that
> node times out with an error
>
>  WARN  [MessagingService-Incoming-/192.168.2.102] 2015-06-23 12:26:00,056
> IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from
> socket; closing
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
> cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
> at
> org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.db.Mutation$MutationSerializer.deserializeOneCf(Mutation.java:322)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:302)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
> at
> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330)
> ~[apache-cassandra-2.1.6.jar:2.1.6]
>
>  What is going on? Did anyone live something like that?
>
>
>

Re: After Restart Nodes had lost data

Posted by Jean Tremblay <je...@zen-innovations.com>.
Does anyone know what to do when such an event occurs?
Does anyone know how this could happen?

I would have tried repairing the node with nodetool repair but that takes much too long… I need my cluster to work for our online system. At this point nothing is working. It’s like the whole cluster is paralysed.
The only solution I see is to remove temporarily that node.

Thanks for your comments.

On 23 Jun 2015, at 12:45 , Jean Tremblay <je...@zen-innovations.com>> wrote:

Hi,

I have a cluster with 5 nodes running Cassandra 2.1.6.

I had to reboot a node. I killed the cassandra process on that node. Rebooted the machine, and restarted Cassandra.

~/apache-cassandra-DATA/data:321>nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  192.168.2.104  158.27 GB  256     ?       6479205e-6a19-49a8-b1a1-7e788ec29caa  rack1
UN  192.168.2.100  4.75 GB    256     ?       e821da50-23c6-4ea0-b3a1-275ded63bc1f  rack1
UN  192.168.2.101  157.43 GB  256     ?       01525665-bacc-4207-a8c3-eb4fd9532401  rack1
UN  192.168.2.102  159.27 GB  256     ?       596a33d7-5089-4c7e-a9ad-e1f22111b160  rack1
UN  192.168.2.103  167 GB     256     ?       0ce1d48e-57a9-4615-8e12-d7ef3d621c7d  rack1


After restarting node 192.168.2.100 I noticed that its load was diminish to 2%. And now when I query the cluster my queries are bombing and that node times out with an error

WARN  [MessagingService-Incoming-/192.168.2.102] 2015-06-23 12:26:00,056 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserializeOneCf(Mutation.java:322) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:302) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330) ~[apache-cassandra-2.1.6.jar:2.1.6]

What is going on? Did anyone live something like that?


After Restart Nodes had lost data

Posted by Jean Tremblay <je...@zen-innovations.com>.
Hi,

I have a cluster with 5 nodes running Cassandra 2.1.6.

I had to reboot a node. I killed the cassandra process on that node. Rebooted the machine, and restarted Cassandra.

~/apache-cassandra-DATA/data:321>nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  192.168.2.104  158.27 GB  256     ?       6479205e-6a19-49a8-b1a1-7e788ec29caa  rack1
UN  192.168.2.100  4.75 GB    256     ?       e821da50-23c6-4ea0-b3a1-275ded63bc1f  rack1
UN  192.168.2.101  157.43 GB  256     ?       01525665-bacc-4207-a8c3-eb4fd9532401  rack1
UN  192.168.2.102  159.27 GB  256     ?       596a33d7-5089-4c7e-a9ad-e1f22111b160  rack1
UN  192.168.2.103  167 GB     256     ?       0ce1d48e-57a9-4615-8e12-d7ef3d621c7d  rack1


After restarting node 192.168.2.100 I noticed that its load was diminish to 2%. And now when I query the cluster my queries are bombing and that node times out with an error

WARN  [MessagingService-Incoming-/192.168.2.102] 2015-06-23 12:26:00,056 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserializeOneCf(Mutation.java:322) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:302) ~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330) ~[apache-cassandra-2.1.6.jar:2.1.6]

What is going on? Did anyone live something like that?

Re: Datastax Java Driver vs Cassandra 2.1.7

Posted by Jean Tremblay <je...@zen-innovations.com>.
I agree. Thanks a lot.
On 23 Jun 2015, at 15:31 , Sam Tunnicliffe <sa...@beobal.com>> wrote:

Although amending the query is a workaround for this (and duplicating the columns in the selection is not something I imagine one would deliberately do), this is still an ugly regression, so I've opened https://issues.apache.org/jira/browse/CASSANDRA-9636 to fix it.

Thanks,
Sam

On Tue, Jun 23, 2015 at 1:52 PM, Jean Tremblay <je...@zen-innovations.com>> wrote:
Hi Sam,

You have a real good gut feeling.
I went to see the query that I used since many months… which was working…. but obviously there is something wrong with it.
The problem with it was *simply* that I placed twice the same field in the select. I corrected in my code and now I don’t have the error with 2.1.7.

This provocated the error on the nodes:
ERROR [SharedPool-Worker-1] 2015-06-23 10:56:01,186 Message.java:538 - Unexpected exception during request; channel = [id: 0x5e809aa1, /192.168.2.8:49581<http://192.168.2.8:49581/> => /192.168.2.201:9042<http://192.168.2.201:9042/>]
java.lang.AssertionError: null
at org.apache.cassandra.cql3.ResultSet.addRow(ResultSet.java:63) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.Selection$ResultSetBuilder.newRow(Selection.java:347) ~[apache-cassandra-2.1.7.jar:2.1.7]

I can also reproduce the error on cqlsh:

cqlsh> select c1, p1, yyyymm, c2, iq, iq from ds.t1 where type='D' and c1=1 and yyyymm>=201401 and yyyymm<=201402 and p1='01';
ServerError: <ErrorMessage code=0000 [Server error] message="java.lang.AssertionError">
cqlsh> select c1, p1, yyyymm, c2, iq  from ds.t1 where type='D' and c1=1 and yyyymm>=201401 and yyyymm<=201402 and p1='01';

 c1     | p1    | yyyymm | c2     | iq
--------+-------+--------+--------+-----------------
      1 |    01 | 201401 |      1 |   {‘XX': 97160}
…

Conclusion… my mistake. Sorry.


On 23 Jun 2015, at 13:06 , Sam Tunnicliffe <sa...@beobal.com>> wrote:

Can you share the query that you're executing when you see the error and the schema of the target table? It could be something related to CASSANDRA-9532.

On Tue, Jun 23, 2015 at 10:05 AM, Jean Tremblay <je...@zen-innovations.com>> wrote:
Hi,

I’m using Datastax Java Driver V 2.1.6
I migrated my cluster to Cassandra V2.1.7
And now I have an error on my client that goes like:

2015-06-23 10:49:11.914  WARN 20955 --- [ I/O worker #14] com.datastax.driver.core.RequestHandler  : /192.168.2.201:9042<http://192.168.2.201:9042/> replied with server error (java.lang.AssertionError), trying next host.

And on the node I have an NPE

ERROR [SharedPool-Worker-1] 2015-06-23 10:56:01,186 Message.java:538 - Unexpected exception during request; channel = [id: 0x5e809aa1, /192.168.2.8:49581<http://192.168.2.8:49581/> => /192.168.2.201:9042<http://192.168.2.201:9042/>]
java.lang.AssertionError: null
at org.apache.cassandra.cql3.ResultSet.addRow(ResultSet.java:63) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.Selection$ResultSetBuilder.newRow(Selection.java:347) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.processColumnFamily(SelectStatement.java:1289) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.process(SelectStatement.java:1223) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.processResults(SelectStatement.java:299) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:238) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:67) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:238) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:493) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:134) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:439) [apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:335) [apache-cassandra-2.1.7.jar:2.1.7]
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext$8.run(AbstractChannelHandlerContext.java:324) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_45]
at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) [apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [apache-cassandra-2.1.7.jar:2.1.7]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]

Is there a known problem on Cassandra 2.1.7?

Thanks for your comments.

Jean





Re: Datastax Java Driver vs Cassandra 2.1.7

Posted by Sam Tunnicliffe <sa...@beobal.com>.
Although amending the query is a workaround for this (and duplicating the
columns in the selection is not something I imagine one would deliberately
do), this is still an ugly regression, so I've opened
https://issues.apache.org/jira/browse/CASSANDRA-9636 to fix it.

Thanks,
Sam

On Tue, Jun 23, 2015 at 1:52 PM, Jean Tremblay <
jean.tremblay@zen-innovations.com> wrote:

>  Hi Sam,
>
>  You have a real good gut feeling.
> I went to see the query that I used since many months… which was working….
> but obviously there is something wrong with it.
> The problem with it was *simply* that I placed twice the same field in the
> select. I corrected in my code and now I don’t have the error with 2.1.7.
>
>  This provocated the error on the nodes:
>
>    ERROR [SharedPool-Worker-1] 2015-06-23 10:56:01,186 Message.java:538 -
>> Unexpected exception during request; channel = [id: 0x5e809aa1, /
>> 192.168.2.8:49581 => /192.168.2.201:9042]
>> java.lang.AssertionError: null
>> at org.apache.cassandra.cql3.ResultSet.addRow(ResultSet.java:63)
>> ~[apache-cassandra-2.1.7.jar:2.1.7]
>> at
>> org.apache.cassandra.cql3.statements.Selection$ResultSetBuilder.newRow(Selection.java:347)
>> ~[apache-cassandra-2.1.7.jar:2.1.7]
>>
>
>>     I can also reproduce the error on cqlsh:
>
>  cqlsh> select c1, p1, yyyymm, c2, iq, iq from ds.t1 where type='D' and
> c1=1 and yyyymm>=201401 and yyyymm<=201402 and p1='01';
>  *ServerError: <ErrorMessage code=0000 [Server error]
> message="java.lang.AssertionError">*
> cqlsh> select c1, p1, yyyymm, c2, iq  from ds.t1 where type='D' and c1=1
> and yyyymm>=201401 and yyyymm<=201402 and p1='01';
>
>    *c1    * | *p1   * | *yyyymm* | *c2    * | *iq*
> --------+-------+--------+--------+-----------------
>       *1* |    *01* | *201401* |      *1* |   *{**‘XX'**: **97160**}*
>  *…*
>
>  *Conclusion… my mistake. Sorry.*
>
>
>   On 23 Jun 2015, at 13:06 , Sam Tunnicliffe <sa...@beobal.com> wrote:
>
>  Can you share the query that you're executing when you see the error and
> the schema of the target table? It could be something related to
> CASSANDRA-9532.
>
> On Tue, Jun 23, 2015 at 10:05 AM, Jean Tremblay <
> jean.tremblay@zen-innovations.com> wrote:
>
>> Hi,
>>
>>  I’m using Datastax Java Driver V 2.1.6
>> I migrated my cluster to Cassandra V2.1.7
>> And now I have an error on my client that goes like:
>>
>>  2015-06-23 10:49:11.914  WARN 20955 --- [ I/O worker #14]
>> com.datastax.driver.core.RequestHandler  : /192.168.2.201:9042 replied
>> with server error (java.lang.AssertionError), trying next host.
>>
>>  And on the node I have an NPE
>>
>>  ERROR [SharedPool-Worker-1] 2015-06-23 10:56:01,186 Message.java:538 -
>> Unexpected exception during request; channel = [id: 0x5e809aa1, /
>> 192.168.2.8:49581 => /192.168.2.201:9042]
>> java.lang.AssertionError: null
>> at org.apache.cassandra.cql3.ResultSet.addRow(ResultSet.java:63)
>> ~[apache-cassandra-2.1.7.jar:2.1.7]
>> at
>> org.apache.cassandra.cql3.statements.Selection$ResultSetBuilder.newRow(Selection.java:347)
>> ~[apache-cassandra-2.1.7.jar:2.1.7]
>> at
>> org.apache.cassandra.cql3.statements.SelectStatement.processColumnFamily(SelectStatement.java:1289)
>> ~[apache-cassandra-2.1.7.jar:2.1.7]
>> at
>> org.apache.cassandra.cql3.statements.SelectStatement.process(SelectStatement.java:1223)
>> ~[apache-cassandra-2.1.7.jar:2.1.7]
>> at
>> org.apache.cassandra.cql3.statements.SelectStatement.processResults(SelectStatement.java:299)
>> ~[apache-cassandra-2.1.7.jar:2.1.7]
>> at
>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:238)
>> ~[apache-cassandra-2.1.7.jar:2.1.7]
>> at
>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:67)
>> ~[apache-cassandra-2.1.7.jar:2.1.7]
>> at
>> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:238)
>> ~[apache-cassandra-2.1.7.jar:2.1.7]
>> at
>> org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:493)
>> ~[apache-cassandra-2.1.7.jar:2.1.7]
>> at
>> org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:134)
>> ~[apache-cassandra-2.1.7.jar:2.1.7]
>> at
>> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:439)
>> [apache-cassandra-2.1.7.jar:2.1.7]
>> at
>> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:335)
>> [apache-cassandra-2.1.7.jar:2.1.7]
>> at
>> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>> [netty-all-4.0.23.Final.jar:4.0.23.Final]
>> at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>> [netty-all-4.0.23.Final.jar:4.0.23.Final]
>> at
>> io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32)
>> [netty-all-4.0.23.Final.jar:4.0.23.Final]
>> at
>> io.netty.channel.AbstractChannelHandlerContext$8.run(AbstractChannelHandlerContext.java:324)
>> [netty-all-4.0.23.Final.jar:4.0.23.Final]
>> at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> [na:1.8.0_45]
>> at
>> org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
>> [apache-cassandra-2.1.7.jar:2.1.7]
>> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
>> [apache-cassandra-2.1.7.jar:2.1.7]
>> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
>>
>>  Is there a known problem on Cassandra 2.1.7?
>>
>>  Thanks for your comments.
>>
>>  Jean
>>
>
>
>

Re: Datastax Java Driver vs Cassandra 2.1.7

Posted by Jean Tremblay <je...@zen-innovations.com>.
Hi Sam,

You have a real good gut feeling.
I went to see the query that I used since many months… which was working…. but obviously there is something wrong with it.
The problem with it was *simply* that I placed twice the same field in the select. I corrected in my code and now I don’t have the error with 2.1.7.

This provocated the error on the nodes:
ERROR [SharedPool-Worker-1] 2015-06-23 10:56:01,186 Message.java:538 - Unexpected exception during request; channel = [id: 0x5e809aa1, /192.168.2.8:49581<http://192.168.2.8:49581/> => /192.168.2.201:9042<http://192.168.2.201:9042/>]
java.lang.AssertionError: null
at org.apache.cassandra.cql3.ResultSet.addRow(ResultSet.java:63) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.Selection$ResultSetBuilder.newRow(Selection.java:347) ~[apache-cassandra-2.1.7.jar:2.1.7]

I can also reproduce the error on cqlsh:

cqlsh> select c1, p1, yyyymm, c2, iq, iq from ds.t1 where type='D' and c1=1 and yyyymm>=201401 and yyyymm<=201402 and p1='01';
ServerError: <ErrorMessage code=0000 [Server error] message="java.lang.AssertionError">
cqlsh> select c1, p1, yyyymm, c2, iq  from ds.t1 where type='D' and c1=1 and yyyymm>=201401 and yyyymm<=201402 and p1='01';

 c1     | p1    | yyyymm | c2     | iq
--------+-------+--------+--------+-----------------
      1 |    01 | 201401 |      1 |   {‘XX': 97160}
…

Conclusion… my mistake. Sorry.


On 23 Jun 2015, at 13:06 , Sam Tunnicliffe <sa...@beobal.com>> wrote:

Can you share the query that you're executing when you see the error and the schema of the target table? It could be something related to CASSANDRA-9532.

On Tue, Jun 23, 2015 at 10:05 AM, Jean Tremblay <je...@zen-innovations.com>> wrote:
Hi,

I’m using Datastax Java Driver V 2.1.6
I migrated my cluster to Cassandra V2.1.7
And now I have an error on my client that goes like:

2015-06-23 10:49:11.914  WARN 20955 --- [ I/O worker #14] com.datastax.driver.core.RequestHandler  : /192.168.2.201:9042<http://192.168.2.201:9042/> replied with server error (java.lang.AssertionError), trying next host.

And on the node I have an NPE

ERROR [SharedPool-Worker-1] 2015-06-23 10:56:01,186 Message.java:538 - Unexpected exception during request; channel = [id: 0x5e809aa1, /192.168.2.8:49581<http://192.168.2.8:49581/> => /192.168.2.201:9042<http://192.168.2.201:9042/>]
java.lang.AssertionError: null
at org.apache.cassandra.cql3.ResultSet.addRow(ResultSet.java:63) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.Selection$ResultSetBuilder.newRow(Selection.java:347) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.processColumnFamily(SelectStatement.java:1289) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.process(SelectStatement.java:1223) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.processResults(SelectStatement.java:299) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:238) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:67) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:238) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:493) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:134) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:439) [apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:335) [apache-cassandra-2.1.7.jar:2.1.7]
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext$8.run(AbstractChannelHandlerContext.java:324) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_45]
at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) [apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [apache-cassandra-2.1.7.jar:2.1.7]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]

Is there a known problem on Cassandra 2.1.7?

Thanks for your comments.

Jean



Re: Datastax Java Driver vs Cassandra 2.1.7

Posted by Sam Tunnicliffe <sa...@beobal.com>.
Can you share the query that you're executing when you see the error and
the schema of the target table? It could be something related to
CASSANDRA-9532.

On Tue, Jun 23, 2015 at 10:05 AM, Jean Tremblay <
jean.tremblay@zen-innovations.com> wrote:

>  Hi,
>
>  I’m using Datastax Java Driver V 2.1.6
> I migrated my cluster to Cassandra V2.1.7
> And now I have an error on my client that goes like:
>
>  2015-06-23 10:49:11.914  WARN 20955 --- [ I/O worker #14]
> com.datastax.driver.core.RequestHandler  : /192.168.2.201:9042 replied
> with server error (java.lang.AssertionError), trying next host.
>
>  And on the node I have an NPE
>
>  ERROR [SharedPool-Worker-1] 2015-06-23 10:56:01,186 Message.java:538 -
> Unexpected exception during request; channel = [id: 0x5e809aa1, /
> 192.168.2.8:49581 => /192.168.2.201:9042]
> java.lang.AssertionError: null
> at org.apache.cassandra.cql3.ResultSet.addRow(ResultSet.java:63)
> ~[apache-cassandra-2.1.7.jar:2.1.7]
> at
> org.apache.cassandra.cql3.statements.Selection$ResultSetBuilder.newRow(Selection.java:347)
> ~[apache-cassandra-2.1.7.jar:2.1.7]
> at
> org.apache.cassandra.cql3.statements.SelectStatement.processColumnFamily(SelectStatement.java:1289)
> ~[apache-cassandra-2.1.7.jar:2.1.7]
> at
> org.apache.cassandra.cql3.statements.SelectStatement.process(SelectStatement.java:1223)
> ~[apache-cassandra-2.1.7.jar:2.1.7]
> at
> org.apache.cassandra.cql3.statements.SelectStatement.processResults(SelectStatement.java:299)
> ~[apache-cassandra-2.1.7.jar:2.1.7]
> at
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:238)
> ~[apache-cassandra-2.1.7.jar:2.1.7]
> at
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:67)
> ~[apache-cassandra-2.1.7.jar:2.1.7]
> at
> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:238)
> ~[apache-cassandra-2.1.7.jar:2.1.7]
> at
> org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:493)
> ~[apache-cassandra-2.1.7.jar:2.1.7]
> at
> org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:134)
> ~[apache-cassandra-2.1.7.jar:2.1.7]
> at
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:439)
> [apache-cassandra-2.1.7.jar:2.1.7]
> at
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:335)
> [apache-cassandra-2.1.7.jar:2.1.7]
> at
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> [netty-all-4.0.23.Final.jar:4.0.23.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
> [netty-all-4.0.23.Final.jar:4.0.23.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32)
> [netty-all-4.0.23.Final.jar:4.0.23.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext$8.run(AbstractChannelHandlerContext.java:324)
> [netty-all-4.0.23.Final.jar:4.0.23.Final]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [na:1.8.0_45]
> at
> org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
> [apache-cassandra-2.1.7.jar:2.1.7]
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
> [apache-cassandra-2.1.7.jar:2.1.7]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
>
>  Is there a known problem on Cassandra 2.1.7?
>
>  Thanks for your comments.
>
>  Jean
>

Datastax Java Driver vs Cassandra 2.1.7

Posted by Jean Tremblay <je...@zen-innovations.com>.
Hi,

I’m using Datastax Java Driver V 2.1.6
I migrated my cluster to Cassandra V2.1.7
And now I have an error on my client that goes like:

2015-06-23 10:49:11.914  WARN 20955 --- [ I/O worker #14] com.datastax.driver.core.RequestHandler  : /192.168.2.201:9042 replied with server error (java.lang.AssertionError), trying next host.

And on the node I have an NPE

ERROR [SharedPool-Worker-1] 2015-06-23 10:56:01,186 Message.java:538 - Unexpected exception during request; channel = [id: 0x5e809aa1, /192.168.2.8:49581 => /192.168.2.201:9042]
java.lang.AssertionError: null
at org.apache.cassandra.cql3.ResultSet.addRow(ResultSet.java:63) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.Selection$ResultSetBuilder.newRow(Selection.java:347) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.processColumnFamily(SelectStatement.java:1289) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.process(SelectStatement.java:1223) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.processResults(SelectStatement.java:299) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:238) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:67) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:238) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:493) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:134) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:439) [apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:335) [apache-cassandra-2.1.7.jar:2.1.7]
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext$8.run(AbstractChannelHandlerContext.java:324) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_45]
at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) [apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [apache-cassandra-2.1.7.jar:2.1.7]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]

Is there a known problem on Cassandra 2.1.7?

Thanks for your comments.

Jean

Re: Detect Repair is finish

Posted by Robert Coli <rc...@eventbrite.com>.
On Mon, Jun 22, 2015 at 6:40 AM, Jean Tremblay <
jean.tremblay@zen-innovations.com> wrote:

> What is the best way to see if a repair is finished? Is there a JMX object
> or is there a command to see if a repair is finished?
> What happens if by mistake an operator starts a repair before the previous
> is not yet finished? Will they execute both one after the other or at the
> same time?
>

Circa (unreleased) 2.2.0 :
https://issues.apache.org/jira/browse/CASSANDRA-5483

=Rob

Detect Repair is finish

Posted by Jean Tremblay <je...@zen-innovations.com>.
Hi,

What is the best way to see if a repair is finished? Is there a JMX object or is there a command to see if a repair is finished?
What happens if by mistake an operator starts a repair before the previous is not yet finished? Will they execute both one after the other or at the same time?

Thanks for your help.

Jean


Re: connections remain on CLOSE_WAIT state after process is killed after upgrade to 2.0.15

Posted by Paulo Ricardo Motta Gomes <pa...@chaordicsystems.com>.
For the record: https://issues.apache.org/jira/browse/CASSANDRA-9630

On Mon, Jun 15, 2015 at 7:19 PM, Paulo Ricardo Motta Gomes <
paulo.motta@chaordicsystems.com> wrote:

> Just a quick update, I was able to fix the problem by reverting the patch
> CASSANDRA-8336 in our custom cassandra build. I don't know the root cause
> yet though. I will open a JIRA ticket and post here for reference later.
>
> On Fri, Jun 12, 2015 at 11:31 AM, Paulo Ricardo Motta Gomes <
> paulo.motta@chaordicsystems.com> wrote:
>
>> Hello,
>>
>> We recently upgraded a cluster from 2.0.12 to 2.0.15 and now whenever we
>> stop/kill a cassandra process, some other nodes keep a connection with the
>> dead node in the CLOSE_WAIT state on port 7000 for about 5-20 minutes.
>>
>> So, if I start the killed node again, it cannot handshake with the nodes
>> which have a connection on the CLOSE_WAIT state until that connection is
>> closed, so they remain on the down state to each other for 5-20 minutes,
>> until they can handshake again.
>>
>> I believe this is somehow related to the fixes CASSANDRA-8336 and
>> CASSANDRA-9238, and also could be a duplicate of CASSANDRA-8072. I will
>> continue to investigate to see if I find more evidences, but any help at
>> this point would be appreciated, or at least a confirmation that it could
>> be related to any of these tickets.
>>
>> Cheers,
>>
>> --
>> *Paulo Motta*
>>
>> Chaordic | *Platform*
>> *www.chaordic.com.br <http://www.chaordic.com.br/>*
>> +55 48 3232.3200
>>
>
>
>
> --
> *Paulo Motta*
>
> Chaordic | *Platform*
> *www.chaordic.com.br <http://www.chaordic.com.br/>*
> +55 48 3232.3200
>



-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br <http://www.chaordic.com.br/>*
+55 48 3232.3200

Re: connections remain on CLOSE_WAIT state after process is killed after upgrade to 2.0.15

Posted by Paulo Ricardo Motta Gomes <pa...@chaordicsystems.com>.
Just a quick update, I was able to fix the problem by reverting the patch
CASSANDRA-8336 in our custom cassandra build. I don't know the root cause
yet though. I will open a JIRA ticket and post here for reference later.

On Fri, Jun 12, 2015 at 11:31 AM, Paulo Ricardo Motta Gomes <
paulo.motta@chaordicsystems.com> wrote:

> Hello,
>
> We recently upgraded a cluster from 2.0.12 to 2.0.15 and now whenever we
> stop/kill a cassandra process, some other nodes keep a connection with the
> dead node in the CLOSE_WAIT state on port 7000 for about 5-20 minutes.
>
> So, if I start the killed node again, it cannot handshake with the nodes
> which have a connection on the CLOSE_WAIT state until that connection is
> closed, so they remain on the down state to each other for 5-20 minutes,
> until they can handshake again.
>
> I believe this is somehow related to the fixes CASSANDRA-8336 and
> CASSANDRA-9238, and also could be a duplicate of CASSANDRA-8072. I will
> continue to investigate to see if I find more evidences, but any help at
> this point would be appreciated, or at least a confirmation that it could
> be related to any of these tickets.
>
> Cheers,
>
> --
> *Paulo Motta*
>
> Chaordic | *Platform*
> *www.chaordic.com.br <http://www.chaordic.com.br/>*
> +55 48 3232.3200
>



-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br <http://www.chaordic.com.br/>*
+55 48 3232.3200