You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Paul Querna (JIRA)" <ji...@apache.org> on 2011/02/19 22:08:38 UTC
[jira] Created: (CASSANDRA-2201) Gossip synchronization issues
Gossip synchronization issues
-----------------------------
Key: CASSANDRA-2201
URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
Project: Cassandra
Issue Type: Bug
Affects Versions: 0.6.12
Environment: r1071793 (0.6.12)
Ubuntu 9.10
24 node cluster.
JNA enabled.
java -version
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
Reporter: Paul Querna
After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.
It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip
Correct nodetool ring output:
{{
pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
Address Status Load Owns Range Ring
163051967482949680409533666061055601315
172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
10.177.192.88 Up 219.28 GB 4.17% 7089215977519551322153637654828504405 | ^
172.21.2.169 Up 225.93 GB 4.17% 14178431955039102644307275309657008810 v |
10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 | ^
172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 v |
10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 | ^
172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 v |
10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 | ^
172.21.2.223 Up 229.38 GB 4.17% 56713727820156410577229101238628035240 v |
10.177.192.225Up 193.1 GB 4.17% 63802943797675961899382738893456539645 | ^
172.21.3.115 Up 231.21 GB 4.17% 70892159775195513221536376548285044050 v |
10.177.192.226Up 194.33 GB 4.17% 77981375752715064543690014203113548455 | ^
172.21.1.32 Up 230.38 GB 4.17% 85070591730234615865843651857942052860 v |
10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
172.21.2.224 Up 205.9 GB 4.17% 99249023685273718510150927167599061670 v |
10.177.192.228Up 191.82 GB 4.17% 106338239662793269832304564822427566075 | ^
172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 v |
10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 | ^
172.21.0.26 Up 226.12 GB 4.17% 127605887595351923798765477786913079290 v |
10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 | ^
172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 v |
10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 | ^
172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 v |
10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
}}
On the node that had a different nodetool ring output:
{{
pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
Address Status Load Owns Range Ring
163051967482949680409533666061055601315
172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
172.21.2.169 Up 225.93 GB 8.33% 14178431955039102644307275309657008810 | ^
10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 v |
172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 | ^
10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 v |
172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 | ^
10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 v |
172.21.3.115 Up 231.21 GB 12.50% 70892159775195513221536376548285044050 | ^
172.21.1.32 Up 230.38 GB 8.33% 85070591730234615865843651857942052860 v |
10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
10.177.192.228Up 191.82 GB 8.33% 106338239662793269832304564822427566075 v |
172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 | ^
10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 v |
172.21.0.26 Up 226 GB 4.17% 127605887595351923798765477786913079290 | ^
10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 v |
172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 | ^
10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 v |
172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 | ^
10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
}}
As you can see, it was missing 10.177.192.226 from the ring.
On cass11, everything else looked fine, including nothing in pending/active tpstats.
However, we did notice an exception on startup in the logs, on cass11
{{
2011-02-19_19:45:43.26906 INFO - Starting up server gossip
2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
2011-02-19_19:45:43.39747 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
2011-02-19_19:45:43.39749 at java.io.DataInputStream.readInt(DataInputStream.java:375)
2011-02-19_19:45:43.39750 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
}}
driftx said that it should be harmless, but its the only thing I see different about that node.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-2201) Gossip synchronization issues
Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12997966#comment-12997966 ]
Gary Dusbabek commented on CASSANDRA-2201:
------------------------------------------
With Brandon's comment in mind, a debug log taken from the node after issuing `nodetool ring` would be helpful. It lists the tokens the node knows about (in StorageService). I'm pretty sure this is happening because StorageService.tokenMetadata_ has incomplete information in it.
> Gossip synchronization issues
> -----------------------------
>
> Key: CASSANDRA-2201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 0.6.12
> Environment: r1071793 (0.6.12)
> Ubuntu 9.10
> 24 node cluster.
> JNA enabled.
> java -version
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
> Reporter: Paul Querna
> Assignee: Brandon Williams
> Attachments: CASSANDRA-2201-jstack.txt, CASSANDRA-2201-startup-trace.txt
>
>
> After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.
> It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip
> Correct nodetool ring output:
> {code}
> pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 10.177.192.88 Up 219.28 GB 4.17% 7089215977519551322153637654828504405 | ^
> 172.21.2.169 Up 225.93 GB 4.17% 14178431955039102644307275309657008810 v |
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 | ^
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 v |
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 | ^
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 v |
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 | ^
> 172.21.2.223 Up 229.38 GB 4.17% 56713727820156410577229101238628035240 v |
> 10.177.192.225Up 193.1 GB 4.17% 63802943797675961899382738893456539645 | ^
> 172.21.3.115 Up 231.21 GB 4.17% 70892159775195513221536376548285044050 v |
> 10.177.192.226Up 194.33 GB 4.17% 77981375752715064543690014203113548455 | ^
> 172.21.1.32 Up 230.38 GB 4.17% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 172.21.2.224 Up 205.9 GB 4.17% 99249023685273718510150927167599061670 v |
> 10.177.192.228Up 191.82 GB 4.17% 106338239662793269832304564822427566075 | ^
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 v |
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 | ^
> 172.21.0.26 Up 226.12 GB 4.17% 127605887595351923798765477786913079290 v |
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 | ^
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 v |
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 | ^
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 v |
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> On the node that had a different nodetool ring output:
> {code}
> pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 172.21.2.169 Up 225.93 GB 8.33% 14178431955039102644307275309657008810 | ^
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 v |
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 | ^
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 v |
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 | ^
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 v |
> 172.21.3.115 Up 231.21 GB 12.50% 70892159775195513221536376548285044050 | ^
> 172.21.1.32 Up 230.38 GB 8.33% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 10.177.192.228Up 191.82 GB 8.33% 106338239662793269832304564822427566075 v |
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 | ^
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 v |
> 172.21.0.26 Up 226 GB 4.17% 127605887595351923798765477786913079290 | ^
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 v |
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 | ^
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 v |
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 | ^
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> As you can see, it was missing 10.177.192.226 from the ring.
> On cass11, everything else looked fine, including nothing in pending/active tpstats.
> However, we did notice an exception on startup in the logs, on cass11
> {code}
> 2011-02-19_19:45:43.26906 INFO - Starting up server gossip
> 2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
> 2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
> 2011-02-19_19:45:43.39747 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
> 2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
> 2011-02-19_19:45:43.39749 at java.io.DataInputStream.readInt(DataInputStream.java:375)
> 2011-02-19_19:45:43.39750 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
> 2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
> 2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
> {code}
> driftx said that it should be harmless, but its the only thing I see different about that node.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Assigned: (CASSANDRA-2201) Gossip synchronization issues
Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brandon Williams reassigned CASSANDRA-2201:
-------------------------------------------
Assignee: Brandon Williams
> Gossip synchronization issues
> -----------------------------
>
> Key: CASSANDRA-2201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 0.6.12
> Environment: r1071793 (0.6.12)
> Ubuntu 9.10
> 24 node cluster.
> JNA enabled.
> java -version
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
> Reporter: Paul Querna
> Assignee: Brandon Williams
> Attachments: CASSANDRA-2201-jstack.txt
>
>
> After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.
> It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip
> Correct nodetool ring output:
> {code}
> pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 10.177.192.88 Up 219.28 GB 4.17% 7089215977519551322153637654828504405 | ^
> 172.21.2.169 Up 225.93 GB 4.17% 14178431955039102644307275309657008810 v |
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 | ^
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 v |
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 | ^
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 v |
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 | ^
> 172.21.2.223 Up 229.38 GB 4.17% 56713727820156410577229101238628035240 v |
> 10.177.192.225Up 193.1 GB 4.17% 63802943797675961899382738893456539645 | ^
> 172.21.3.115 Up 231.21 GB 4.17% 70892159775195513221536376548285044050 v |
> 10.177.192.226Up 194.33 GB 4.17% 77981375752715064543690014203113548455 | ^
> 172.21.1.32 Up 230.38 GB 4.17% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 172.21.2.224 Up 205.9 GB 4.17% 99249023685273718510150927167599061670 v |
> 10.177.192.228Up 191.82 GB 4.17% 106338239662793269832304564822427566075 | ^
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 v |
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 | ^
> 172.21.0.26 Up 226.12 GB 4.17% 127605887595351923798765477786913079290 v |
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 | ^
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 v |
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 | ^
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 v |
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> On the node that had a different nodetool ring output:
> {code}
> pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 172.21.2.169 Up 225.93 GB 8.33% 14178431955039102644307275309657008810 | ^
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 v |
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 | ^
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 v |
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 | ^
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 v |
> 172.21.3.115 Up 231.21 GB 12.50% 70892159775195513221536376548285044050 | ^
> 172.21.1.32 Up 230.38 GB 8.33% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 10.177.192.228Up 191.82 GB 8.33% 106338239662793269832304564822427566075 v |
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 | ^
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 v |
> 172.21.0.26 Up 226 GB 4.17% 127605887595351923798765477786913079290 | ^
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 v |
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 | ^
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 v |
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 | ^
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> As you can see, it was missing 10.177.192.226 from the ring.
> On cass11, everything else looked fine, including nothing in pending/active tpstats.
> However, we did notice an exception on startup in the logs, on cass11
> {code}
> 2011-02-19_19:45:43.26906 INFO - Starting up server gossip
> 2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
> 2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
> 2011-02-19_19:45:43.39747 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
> 2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
> 2011-02-19_19:45:43.39749 at java.io.DataInputStream.readInt(DataInputStream.java:375)
> 2011-02-19_19:45:43.39750 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
> 2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
> 2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
> {code}
> driftx said that it should be harmless, but its the only thing I see different about that node.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (CASSANDRA-2201) Gossip synchronization issues
Posted by "Paul Querna (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Querna updated CASSANDRA-2201:
-----------------------------------
Attachment: CASSANDRA-2201-startup-trace.txt
log from a node with a bad ring, TRACE logging enabled on log4j.logger.org.apache.cassandra.gms
After this start, this node saw the ring as the following:
{code}
$ /data/cassandra/bin/nodetool -h localhost ring
Address Status Load Owns Range Ring
163051967482949680409533666061055601315
172.21.2.222 Up 224.19 GB 4.17% 0 |<--|
172.21.2.169 Up 226.34 GB 8.33% 14178431955039102644307275309657008810 | ^
10.177.192.89 Up 226.05 GB 4.17% 21267647932558653966460912964485513215 v |
172.21.3.116 Up 227.23 GB 4.17% 28356863910078205288614550619314017620 | ^
172.21.2.173 Up 227.64 GB 8.33% 42535295865117307932921825928971026430 v |
172.21.2.223 Up 229.14 GB 8.33% 56713727820156410577229101238628035240 | ^
172.21.3.115 Up 231.46 GB 8.33% 70892159775195513221536376548285044050 v |
10.177.192.226Up 194.35 GB 4.17% 77981375752715064543690014203113548455 | ^
172.21.1.32 Up 230.77 GB 4.17% 85070591730234615865843651857942052860 v |
10.177.192.227Up 196.68 GB 4.17% 92159807707754167187997289512770557265 | ^
10.177.192.228Up 192.05 GB 8.33% 106338239662793269832304564822427566075 v |
172.21.3.117 Up 230.9 GB 4.17% 113427455640312821154458202477256070480 | ^
172.21.0.26 Up 226.33 GB 8.33% 127605887595351923798765477786913079290 v |
10.177.192.230Up 187.47 GB 4.17% 134695103572871475120919115441741583695 | ^
172.21.2.225 Up 230.63 GB 4.17% 141784319550391026443072753096570088100 v |
10.177.192.231Up 188.33 GB 4.17% 148873535527910577765226390751398592505 | ^
172.21.3.119 Up 216.22 GB 4.17% 155962751505430129087380028406227096910 v |
10.177.192.232Up 217.72 GB 4.17% 163051967482949680409533666061055601315 |-->|
{code}
> Gossip synchronization issues
> -----------------------------
>
> Key: CASSANDRA-2201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 0.6.12
> Environment: r1071793 (0.6.12)
> Ubuntu 9.10
> 24 node cluster.
> JNA enabled.
> java -version
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
> Reporter: Paul Querna
> Assignee: Brandon Williams
> Attachments: CASSANDRA-2201-jstack.txt, CASSANDRA-2201-startup-trace.txt
>
>
> After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.
> It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip
> Correct nodetool ring output:
> {code}
> pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 10.177.192.88 Up 219.28 GB 4.17% 7089215977519551322153637654828504405 | ^
> 172.21.2.169 Up 225.93 GB 4.17% 14178431955039102644307275309657008810 v |
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 | ^
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 v |
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 | ^
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 v |
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 | ^
> 172.21.2.223 Up 229.38 GB 4.17% 56713727820156410577229101238628035240 v |
> 10.177.192.225Up 193.1 GB 4.17% 63802943797675961899382738893456539645 | ^
> 172.21.3.115 Up 231.21 GB 4.17% 70892159775195513221536376548285044050 v |
> 10.177.192.226Up 194.33 GB 4.17% 77981375752715064543690014203113548455 | ^
> 172.21.1.32 Up 230.38 GB 4.17% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 172.21.2.224 Up 205.9 GB 4.17% 99249023685273718510150927167599061670 v |
> 10.177.192.228Up 191.82 GB 4.17% 106338239662793269832304564822427566075 | ^
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 v |
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 | ^
> 172.21.0.26 Up 226.12 GB 4.17% 127605887595351923798765477786913079290 v |
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 | ^
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 v |
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 | ^
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 v |
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> On the node that had a different nodetool ring output:
> {code}
> pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 172.21.2.169 Up 225.93 GB 8.33% 14178431955039102644307275309657008810 | ^
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 v |
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 | ^
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 v |
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 | ^
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 v |
> 172.21.3.115 Up 231.21 GB 12.50% 70892159775195513221536376548285044050 | ^
> 172.21.1.32 Up 230.38 GB 8.33% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 10.177.192.228Up 191.82 GB 8.33% 106338239662793269832304564822427566075 v |
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 | ^
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 v |
> 172.21.0.26 Up 226 GB 4.17% 127605887595351923798765477786913079290 | ^
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 v |
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 | ^
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 v |
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 | ^
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> As you can see, it was missing 10.177.192.226 from the ring.
> On cass11, everything else looked fine, including nothing in pending/active tpstats.
> However, we did notice an exception on startup in the logs, on cass11
> {code}
> 2011-02-19_19:45:43.26906 INFO - Starting up server gossip
> 2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
> 2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
> 2011-02-19_19:45:43.39747 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
> 2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
> 2011-02-19_19:45:43.39749 at java.io.DataInputStream.readInt(DataInputStream.java:375)
> 2011-02-19_19:45:43.39750 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
> 2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
> 2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
> {code}
> driftx said that it should be harmless, but its the only thing I see different about that node.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (CASSANDRA-2201) Gossip synchronization issues
Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Ellis resolved CASSANDRA-2201.
---------------------------------------
Resolution: Cannot Reproduce
Assignee: (was: Brandon Williams)
> Gossip synchronization issues
> -----------------------------
>
> Key: CASSANDRA-2201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 0.6.12
> Environment: r1071793 (0.6.12)
> Ubuntu 9.10
> 24 node cluster.
> JNA enabled.
> java -version
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
> Reporter: Paul Querna
> Attachments: CASSANDRA-2201-jstack.txt, CASSANDRA-2201-startup-trace.txt
>
>
> After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.
> It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip
> Correct nodetool ring output:
> {code}
> pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 10.177.192.88 Up 219.28 GB 4.17% 7089215977519551322153637654828504405 | ^
> 172.21.2.169 Up 225.93 GB 4.17% 14178431955039102644307275309657008810 v |
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 | ^
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 v |
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 | ^
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 v |
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 | ^
> 172.21.2.223 Up 229.38 GB 4.17% 56713727820156410577229101238628035240 v |
> 10.177.192.225Up 193.1 GB 4.17% 63802943797675961899382738893456539645 | ^
> 172.21.3.115 Up 231.21 GB 4.17% 70892159775195513221536376548285044050 v |
> 10.177.192.226Up 194.33 GB 4.17% 77981375752715064543690014203113548455 | ^
> 172.21.1.32 Up 230.38 GB 4.17% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 172.21.2.224 Up 205.9 GB 4.17% 99249023685273718510150927167599061670 v |
> 10.177.192.228Up 191.82 GB 4.17% 106338239662793269832304564822427566075 | ^
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 v |
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 | ^
> 172.21.0.26 Up 226.12 GB 4.17% 127605887595351923798765477786913079290 v |
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 | ^
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 v |
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 | ^
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 v |
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> On the node that had a different nodetool ring output:
> {code}
> pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 172.21.2.169 Up 225.93 GB 8.33% 14178431955039102644307275309657008810 | ^
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 v |
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 | ^
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 v |
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 | ^
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 v |
> 172.21.3.115 Up 231.21 GB 12.50% 70892159775195513221536376548285044050 | ^
> 172.21.1.32 Up 230.38 GB 8.33% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 10.177.192.228Up 191.82 GB 8.33% 106338239662793269832304564822427566075 v |
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 | ^
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 v |
> 172.21.0.26 Up 226 GB 4.17% 127605887595351923798765477786913079290 | ^
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 v |
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 | ^
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 v |
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 | ^
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> As you can see, it was missing 10.177.192.226 from the ring.
> On cass11, everything else looked fine, including nothing in pending/active tpstats.
> However, we did notice an exception on startup in the logs, on cass11
> {code}
> 2011-02-19_19:45:43.26906 INFO - Starting up server gossip
> 2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
> 2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
> 2011-02-19_19:45:43.39747 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
> 2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
> 2011-02-19_19:45:43.39749 at java.io.DataInputStream.readInt(DataInputStream.java:375)
> 2011-02-19_19:45:43.39750 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
> 2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
> 2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
> {code}
> driftx said that it should be harmless, but its the only thing I see different about that node.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Issue Comment Edited: (CASSANDRA-2201) Gossip
synchronization issues
Posted by "Paul Querna (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996863#comment-12996863 ]
Paul Querna edited comment on CASSANDRA-2201 at 2/19/11 9:33 PM:
-----------------------------------------------------------------
After more investigation, we found that 3 of 24 nodes had inconsistent rings.
-Restarting them each seems to have fixed this issue.-
Restarting sometimes fixed the issue.
was (Author: pquerna):
After more investigation, we found that 3 of 24 nodes had inconsistent rings.
Restarting them each seems to have fixed this issue.
> Gossip synchronization issues
> -----------------------------
>
> Key: CASSANDRA-2201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 0.6.12
> Environment: r1071793 (0.6.12)
> Ubuntu 9.10
> 24 node cluster.
> JNA enabled.
> java -version
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
> Reporter: Paul Querna
> Attachments: CASSANDRA-2201-jstack.txt
>
>
> After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.
> It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip
> Correct nodetool ring output:
> {code}
> pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 10.177.192.88 Up 219.28 GB 4.17% 7089215977519551322153637654828504405 | ^
> 172.21.2.169 Up 225.93 GB 4.17% 14178431955039102644307275309657008810 v |
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 | ^
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 v |
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 | ^
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 v |
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 | ^
> 172.21.2.223 Up 229.38 GB 4.17% 56713727820156410577229101238628035240 v |
> 10.177.192.225Up 193.1 GB 4.17% 63802943797675961899382738893456539645 | ^
> 172.21.3.115 Up 231.21 GB 4.17% 70892159775195513221536376548285044050 v |
> 10.177.192.226Up 194.33 GB 4.17% 77981375752715064543690014203113548455 | ^
> 172.21.1.32 Up 230.38 GB 4.17% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 172.21.2.224 Up 205.9 GB 4.17% 99249023685273718510150927167599061670 v |
> 10.177.192.228Up 191.82 GB 4.17% 106338239662793269832304564822427566075 | ^
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 v |
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 | ^
> 172.21.0.26 Up 226.12 GB 4.17% 127605887595351923798765477786913079290 v |
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 | ^
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 v |
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 | ^
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 v |
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> On the node that had a different nodetool ring output:
> {code}
> pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 172.21.2.169 Up 225.93 GB 8.33% 14178431955039102644307275309657008810 | ^
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 v |
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 | ^
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 v |
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 | ^
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 v |
> 172.21.3.115 Up 231.21 GB 12.50% 70892159775195513221536376548285044050 | ^
> 172.21.1.32 Up 230.38 GB 8.33% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 10.177.192.228Up 191.82 GB 8.33% 106338239662793269832304564822427566075 v |
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 | ^
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 v |
> 172.21.0.26 Up 226 GB 4.17% 127605887595351923798765477786913079290 | ^
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 v |
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 | ^
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 v |
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 | ^
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> As you can see, it was missing 10.177.192.226 from the ring.
> On cass11, everything else looked fine, including nothing in pending/active tpstats.
> However, we did notice an exception on startup in the logs, on cass11
> {code}
> 2011-02-19_19:45:43.26906 INFO - Starting up server gossip
> 2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
> 2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
> 2011-02-19_19:45:43.39747 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
> 2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
> 2011-02-19_19:45:43.39749 at java.io.DataInputStream.readInt(DataInputStream.java:375)
> 2011-02-19_19:45:43.39750 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
> 2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
> 2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
> {code}
> driftx said that it should be harmless, but its the only thing I see different about that node.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (CASSANDRA-2201) Gossip synchronization issues
Posted by "Paul Querna (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Querna updated CASSANDRA-2201:
-----------------------------------
Attachment: CASSANDRA-2201-jstack.txt
jstack of one of the other nodes we found in this state
> Gossip synchronization issues
> -----------------------------
>
> Key: CASSANDRA-2201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 0.6.12
> Environment: r1071793 (0.6.12)
> Ubuntu 9.10
> 24 node cluster.
> JNA enabled.
> java -version
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
> Reporter: Paul Querna
> Attachments: CASSANDRA-2201-jstack.txt
>
>
> After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.
> It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip
> Correct nodetool ring output:
> {code}
> pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 10.177.192.88 Up 219.28 GB 4.17% 7089215977519551322153637654828504405 | ^
> 172.21.2.169 Up 225.93 GB 4.17% 14178431955039102644307275309657008810 v |
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 | ^
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 v |
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 | ^
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 v |
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 | ^
> 172.21.2.223 Up 229.38 GB 4.17% 56713727820156410577229101238628035240 v |
> 10.177.192.225Up 193.1 GB 4.17% 63802943797675961899382738893456539645 | ^
> 172.21.3.115 Up 231.21 GB 4.17% 70892159775195513221536376548285044050 v |
> 10.177.192.226Up 194.33 GB 4.17% 77981375752715064543690014203113548455 | ^
> 172.21.1.32 Up 230.38 GB 4.17% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 172.21.2.224 Up 205.9 GB 4.17% 99249023685273718510150927167599061670 v |
> 10.177.192.228Up 191.82 GB 4.17% 106338239662793269832304564822427566075 | ^
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 v |
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 | ^
> 172.21.0.26 Up 226.12 GB 4.17% 127605887595351923798765477786913079290 v |
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 | ^
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 v |
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 | ^
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 v |
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> On the node that had a different nodetool ring output:
> {code}
> pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 172.21.2.169 Up 225.93 GB 8.33% 14178431955039102644307275309657008810 | ^
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 v |
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 | ^
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 v |
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 | ^
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 v |
> 172.21.3.115 Up 231.21 GB 12.50% 70892159775195513221536376548285044050 | ^
> 172.21.1.32 Up 230.38 GB 8.33% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 10.177.192.228Up 191.82 GB 8.33% 106338239662793269832304564822427566075 v |
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 | ^
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 v |
> 172.21.0.26 Up 226 GB 4.17% 127605887595351923798765477786913079290 | ^
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 v |
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 | ^
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 v |
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 | ^
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> As you can see, it was missing 10.177.192.226 from the ring.
> On cass11, everything else looked fine, including nothing in pending/active tpstats.
> However, we did notice an exception on startup in the logs, on cass11
> {code}
> 2011-02-19_19:45:43.26906 INFO - Starting up server gossip
> 2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
> 2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
> 2011-02-19_19:45:43.39747 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
> 2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
> 2011-02-19_19:45:43.39749 at java.io.DataInputStream.readInt(DataInputStream.java:375)
> 2011-02-19_19:45:43.39750 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
> 2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
> 2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
> {code}
> driftx said that it should be harmless, but its the only thing I see different about that node.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Issue Comment Edited: (CASSANDRA-2201) Gossip
synchronization issues
Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996931#comment-12996931 ]
Brandon Williams edited comment on CASSANDRA-2201 at 2/19/11 11:33 PM:
-----------------------------------------------------------------------
Note: this log appears to be a capture from stdout and not the file, since there are no class names or line numbers, which would make things slightly easier.
The one thing this log indicates is, gossip is not the problem. There are 25 gossip digests being sent by this node, which is correct for a 24 node cluster before CASSANDRA-2092 since it sends its own digest twice. There are also 23 unique 'is now UP' messages, which is also correct since it wouldn't have one for itself.
This leads me to believe the problem must be in either the storage service or the replication strategy where the range to endpoint map is constructed, but I don't see anything obvious there. Are you using RAS/PFEPS?
was (Author: brandon.williams):
Note: this log appears to be a capture from stdout and not the file, since there are no class names or line numbers, which would make things slightly easier.
The one thing this log indicates is, gossip is not the problem. There are 25 gossip digests being sent by this node, which is correct for a 24 node cluster before CASSANDRA-2092 since it sends its own digest twice. There are also 23 unique 'is now UP' messages, which is also correct since it wouldn't have one for itself.
This leads me to believe the problem must be in either the storage service or the replication where the range to endpoint map is constructed, but I don't see anything obvious there.
> Gossip synchronization issues
> -----------------------------
>
> Key: CASSANDRA-2201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 0.6.12
> Environment: r1071793 (0.6.12)
> Ubuntu 9.10
> 24 node cluster.
> JNA enabled.
> java -version
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
> Reporter: Paul Querna
> Assignee: Brandon Williams
> Attachments: CASSANDRA-2201-jstack.txt, CASSANDRA-2201-startup-trace.txt
>
>
> After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.
> It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip
> Correct nodetool ring output:
> {code}
> pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 10.177.192.88 Up 219.28 GB 4.17% 7089215977519551322153637654828504405 | ^
> 172.21.2.169 Up 225.93 GB 4.17% 14178431955039102644307275309657008810 v |
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 | ^
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 v |
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 | ^
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 v |
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 | ^
> 172.21.2.223 Up 229.38 GB 4.17% 56713727820156410577229101238628035240 v |
> 10.177.192.225Up 193.1 GB 4.17% 63802943797675961899382738893456539645 | ^
> 172.21.3.115 Up 231.21 GB 4.17% 70892159775195513221536376548285044050 v |
> 10.177.192.226Up 194.33 GB 4.17% 77981375752715064543690014203113548455 | ^
> 172.21.1.32 Up 230.38 GB 4.17% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 172.21.2.224 Up 205.9 GB 4.17% 99249023685273718510150927167599061670 v |
> 10.177.192.228Up 191.82 GB 4.17% 106338239662793269832304564822427566075 | ^
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 v |
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 | ^
> 172.21.0.26 Up 226.12 GB 4.17% 127605887595351923798765477786913079290 v |
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 | ^
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 v |
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 | ^
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 v |
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> On the node that had a different nodetool ring output:
> {code}
> pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 172.21.2.169 Up 225.93 GB 8.33% 14178431955039102644307275309657008810 | ^
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 v |
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 | ^
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 v |
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 | ^
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 v |
> 172.21.3.115 Up 231.21 GB 12.50% 70892159775195513221536376548285044050 | ^
> 172.21.1.32 Up 230.38 GB 8.33% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 10.177.192.228Up 191.82 GB 8.33% 106338239662793269832304564822427566075 v |
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 | ^
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 v |
> 172.21.0.26 Up 226 GB 4.17% 127605887595351923798765477786913079290 | ^
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 v |
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 | ^
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 v |
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 | ^
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> As you can see, it was missing 10.177.192.226 from the ring.
> On cass11, everything else looked fine, including nothing in pending/active tpstats.
> However, we did notice an exception on startup in the logs, on cass11
> {code}
> 2011-02-19_19:45:43.26906 INFO - Starting up server gossip
> 2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
> 2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
> 2011-02-19_19:45:43.39747 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
> 2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
> 2011-02-19_19:45:43.39749 at java.io.DataInputStream.readInt(DataInputStream.java:375)
> 2011-02-19_19:45:43.39750 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
> 2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
> 2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
> {code}
> driftx said that it should be harmless, but its the only thing I see different about that node.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-2201) Gossip synchronization issues
Posted by "Paul Querna (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996863#comment-12996863 ]
Paul Querna commented on CASSANDRA-2201:
----------------------------------------
After more investigation, we found that 3 of 24 nodes had inconsistent rings.
Restarting them each seems to have fixed this issue.
> Gossip synchronization issues
> -----------------------------
>
> Key: CASSANDRA-2201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 0.6.12
> Environment: r1071793 (0.6.12)
> Ubuntu 9.10
> 24 node cluster.
> JNA enabled.
> java -version
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
> Reporter: Paul Querna
>
> After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.
> It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip
> Correct nodetool ring output:
> {code}
> pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 10.177.192.88 Up 219.28 GB 4.17% 7089215977519551322153637654828504405 | ^
> 172.21.2.169 Up 225.93 GB 4.17% 14178431955039102644307275309657008810 v |
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 | ^
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 v |
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 | ^
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 v |
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 | ^
> 172.21.2.223 Up 229.38 GB 4.17% 56713727820156410577229101238628035240 v |
> 10.177.192.225Up 193.1 GB 4.17% 63802943797675961899382738893456539645 | ^
> 172.21.3.115 Up 231.21 GB 4.17% 70892159775195513221536376548285044050 v |
> 10.177.192.226Up 194.33 GB 4.17% 77981375752715064543690014203113548455 | ^
> 172.21.1.32 Up 230.38 GB 4.17% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 172.21.2.224 Up 205.9 GB 4.17% 99249023685273718510150927167599061670 v |
> 10.177.192.228Up 191.82 GB 4.17% 106338239662793269832304564822427566075 | ^
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 v |
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 | ^
> 172.21.0.26 Up 226.12 GB 4.17% 127605887595351923798765477786913079290 v |
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 | ^
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 v |
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 | ^
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 v |
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> On the node that had a different nodetool ring output:
> {code}
> pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 172.21.2.169 Up 225.93 GB 8.33% 14178431955039102644307275309657008810 | ^
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 v |
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 | ^
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 v |
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 | ^
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 v |
> 172.21.3.115 Up 231.21 GB 12.50% 70892159775195513221536376548285044050 | ^
> 172.21.1.32 Up 230.38 GB 8.33% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 10.177.192.228Up 191.82 GB 8.33% 106338239662793269832304564822427566075 v |
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 | ^
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 v |
> 172.21.0.26 Up 226 GB 4.17% 127605887595351923798765477786913079290 | ^
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 v |
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 | ^
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 v |
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 | ^
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> As you can see, it was missing 10.177.192.226 from the ring.
> On cass11, everything else looked fine, including nothing in pending/active tpstats.
> However, we did notice an exception on startup in the logs, on cass11
> {code}
> 2011-02-19_19:45:43.26906 INFO - Starting up server gossip
> 2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
> 2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
> 2011-02-19_19:45:43.39747 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
> 2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
> 2011-02-19_19:45:43.39749 at java.io.DataInputStream.readInt(DataInputStream.java:375)
> 2011-02-19_19:45:43.39750 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
> 2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
> 2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
> {code}
> driftx said that it should be harmless, but its the only thing I see different about that node.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (CASSANDRA-2201) Gossip synchronization issues
Posted by "Paul Querna (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Querna updated CASSANDRA-2201:
-----------------------------------
Description:
After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.
It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip
Correct nodetool ring output:
{code}
pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
Address Status Load Owns Range Ring
163051967482949680409533666061055601315
172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
10.177.192.88 Up 219.28 GB 4.17% 7089215977519551322153637654828504405 | ^
172.21.2.169 Up 225.93 GB 4.17% 14178431955039102644307275309657008810 v |
10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 | ^
172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 v |
10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 | ^
172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 v |
10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 | ^
172.21.2.223 Up 229.38 GB 4.17% 56713727820156410577229101238628035240 v |
10.177.192.225Up 193.1 GB 4.17% 63802943797675961899382738893456539645 | ^
172.21.3.115 Up 231.21 GB 4.17% 70892159775195513221536376548285044050 v |
10.177.192.226Up 194.33 GB 4.17% 77981375752715064543690014203113548455 | ^
172.21.1.32 Up 230.38 GB 4.17% 85070591730234615865843651857942052860 v |
10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
172.21.2.224 Up 205.9 GB 4.17% 99249023685273718510150927167599061670 v |
10.177.192.228Up 191.82 GB 4.17% 106338239662793269832304564822427566075 | ^
172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 v |
10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 | ^
172.21.0.26 Up 226.12 GB 4.17% 127605887595351923798765477786913079290 v |
10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 | ^
172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 v |
10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 | ^
172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 v |
10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
{code}
On the node that had a different nodetool ring output:
{code}
pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
Address Status Load Owns Range Ring
163051967482949680409533666061055601315
172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
172.21.2.169 Up 225.93 GB 8.33% 14178431955039102644307275309657008810 | ^
10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 v |
172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 | ^
10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 v |
172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 | ^
10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 v |
172.21.3.115 Up 231.21 GB 12.50% 70892159775195513221536376548285044050 | ^
172.21.1.32 Up 230.38 GB 8.33% 85070591730234615865843651857942052860 v |
10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
10.177.192.228Up 191.82 GB 8.33% 106338239662793269832304564822427566075 v |
172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 | ^
10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 v |
172.21.0.26 Up 226 GB 4.17% 127605887595351923798765477786913079290 | ^
10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 v |
172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 | ^
10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 v |
172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 | ^
10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
{code}
As you can see, it was missing 10.177.192.226 from the ring.
On cass11, everything else looked fine, including nothing in pending/active tpstats.
However, we did notice an exception on startup in the logs, on cass11
{code}
2011-02-19_19:45:43.26906 INFO - Starting up server gossip
2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
2011-02-19_19:45:43.39747 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
2011-02-19_19:45:43.39749 at java.io.DataInputStream.readInt(DataInputStream.java:375)
2011-02-19_19:45:43.39750 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
{code}
driftx said that it should be harmless, but its the only thing I see different about that node.
was:
After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.
It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip
Correct nodetool ring output:
{{
pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
Address Status Load Owns Range Ring
163051967482949680409533666061055601315
172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
10.177.192.88 Up 219.28 GB 4.17% 7089215977519551322153637654828504405 | ^
172.21.2.169 Up 225.93 GB 4.17% 14178431955039102644307275309657008810 v |
10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 | ^
172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 v |
10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 | ^
172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 v |
10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 | ^
172.21.2.223 Up 229.38 GB 4.17% 56713727820156410577229101238628035240 v |
10.177.192.225Up 193.1 GB 4.17% 63802943797675961899382738893456539645 | ^
172.21.3.115 Up 231.21 GB 4.17% 70892159775195513221536376548285044050 v |
10.177.192.226Up 194.33 GB 4.17% 77981375752715064543690014203113548455 | ^
172.21.1.32 Up 230.38 GB 4.17% 85070591730234615865843651857942052860 v |
10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
172.21.2.224 Up 205.9 GB 4.17% 99249023685273718510150927167599061670 v |
10.177.192.228Up 191.82 GB 4.17% 106338239662793269832304564822427566075 | ^
172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 v |
10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 | ^
172.21.0.26 Up 226.12 GB 4.17% 127605887595351923798765477786913079290 v |
10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 | ^
172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 v |
10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 | ^
172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 v |
10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
}}
On the node that had a different nodetool ring output:
{{
pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
Address Status Load Owns Range Ring
163051967482949680409533666061055601315
172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
172.21.2.169 Up 225.93 GB 8.33% 14178431955039102644307275309657008810 | ^
10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 v |
172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 | ^
10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 v |
172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 | ^
10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 v |
172.21.3.115 Up 231.21 GB 12.50% 70892159775195513221536376548285044050 | ^
172.21.1.32 Up 230.38 GB 8.33% 85070591730234615865843651857942052860 v |
10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
10.177.192.228Up 191.82 GB 8.33% 106338239662793269832304564822427566075 v |
172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 | ^
10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 v |
172.21.0.26 Up 226 GB 4.17% 127605887595351923798765477786913079290 | ^
10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 v |
172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 | ^
10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 v |
172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 | ^
10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
}}
As you can see, it was missing 10.177.192.226 from the ring.
On cass11, everything else looked fine, including nothing in pending/active tpstats.
However, we did notice an exception on startup in the logs, on cass11
{{
2011-02-19_19:45:43.26906 INFO - Starting up server gossip
2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
2011-02-19_19:45:43.39747 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
2011-02-19_19:45:43.39749 at java.io.DataInputStream.readInt(DataInputStream.java:375)
2011-02-19_19:45:43.39750 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
}}
driftx said that it should be harmless, but its the only thing I see different about that node.
> Gossip synchronization issues
> -----------------------------
>
> Key: CASSANDRA-2201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 0.6.12
> Environment: r1071793 (0.6.12)
> Ubuntu 9.10
> 24 node cluster.
> JNA enabled.
> java -version
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
> Reporter: Paul Querna
>
> After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.
> It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip
> Correct nodetool ring output:
> {code}
> pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 10.177.192.88 Up 219.28 GB 4.17% 7089215977519551322153637654828504405 | ^
> 172.21.2.169 Up 225.93 GB 4.17% 14178431955039102644307275309657008810 v |
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 | ^
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 v |
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 | ^
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 v |
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 | ^
> 172.21.2.223 Up 229.38 GB 4.17% 56713727820156410577229101238628035240 v |
> 10.177.192.225Up 193.1 GB 4.17% 63802943797675961899382738893456539645 | ^
> 172.21.3.115 Up 231.21 GB 4.17% 70892159775195513221536376548285044050 v |
> 10.177.192.226Up 194.33 GB 4.17% 77981375752715064543690014203113548455 | ^
> 172.21.1.32 Up 230.38 GB 4.17% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 172.21.2.224 Up 205.9 GB 4.17% 99249023685273718510150927167599061670 v |
> 10.177.192.228Up 191.82 GB 4.17% 106338239662793269832304564822427566075 | ^
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 v |
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 | ^
> 172.21.0.26 Up 226.12 GB 4.17% 127605887595351923798765477786913079290 v |
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 | ^
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 v |
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 | ^
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 v |
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> On the node that had a different nodetool ring output:
> {code}
> pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 172.21.2.169 Up 225.93 GB 8.33% 14178431955039102644307275309657008810 | ^
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 v |
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 | ^
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 v |
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 | ^
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 v |
> 172.21.3.115 Up 231.21 GB 12.50% 70892159775195513221536376548285044050 | ^
> 172.21.1.32 Up 230.38 GB 8.33% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 10.177.192.228Up 191.82 GB 8.33% 106338239662793269832304564822427566075 v |
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 | ^
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 v |
> 172.21.0.26 Up 226 GB 4.17% 127605887595351923798765477786913079290 | ^
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 v |
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 | ^
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 v |
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 | ^
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> As you can see, it was missing 10.177.192.226 from the ring.
> On cass11, everything else looked fine, including nothing in pending/active tpstats.
> However, we did notice an exception on startup in the logs, on cass11
> {code}
> 2011-02-19_19:45:43.26906 INFO - Starting up server gossip
> 2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
> 2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
> 2011-02-19_19:45:43.39747 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
> 2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
> 2011-02-19_19:45:43.39749 at java.io.DataInputStream.readInt(DataInputStream.java:375)
> 2011-02-19_19:45:43.39750 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
> 2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
> 2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
> {code}
> driftx said that it should be harmless, but its the only thing I see different about that node.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-2201) Gossip synchronization issues
Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996931#comment-12996931 ]
Brandon Williams commented on CASSANDRA-2201:
---------------------------------------------
Note: this log appears to be a capture from stdout and not the file, since there are no class names or line numbers, which would make things slightly easier.
The one thing this log indicates is, gossip is not the problem. There are 25 gossip digests being sent by this node, which is correct for a 24 node cluster before CASSANDRA-2092 since it sends its own digest twice. There are also 23 unique 'is now UP' messages, which is also correct since it wouldn't have one for itself.
This leads me to believe the problem must be in either the storage service or the replication where the range to endpoint map is constructed, but I don't see anything obvious there.
> Gossip synchronization issues
> -----------------------------
>
> Key: CASSANDRA-2201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 0.6.12
> Environment: r1071793 (0.6.12)
> Ubuntu 9.10
> 24 node cluster.
> JNA enabled.
> java -version
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
> Reporter: Paul Querna
> Assignee: Brandon Williams
> Attachments: CASSANDRA-2201-jstack.txt, CASSANDRA-2201-startup-trace.txt
>
>
> After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.
> It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip
> Correct nodetool ring output:
> {code}
> pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 10.177.192.88 Up 219.28 GB 4.17% 7089215977519551322153637654828504405 | ^
> 172.21.2.169 Up 225.93 GB 4.17% 14178431955039102644307275309657008810 v |
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 | ^
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 v |
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 | ^
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 v |
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 | ^
> 172.21.2.223 Up 229.38 GB 4.17% 56713727820156410577229101238628035240 v |
> 10.177.192.225Up 193.1 GB 4.17% 63802943797675961899382738893456539645 | ^
> 172.21.3.115 Up 231.21 GB 4.17% 70892159775195513221536376548285044050 v |
> 10.177.192.226Up 194.33 GB 4.17% 77981375752715064543690014203113548455 | ^
> 172.21.1.32 Up 230.38 GB 4.17% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 172.21.2.224 Up 205.9 GB 4.17% 99249023685273718510150927167599061670 v |
> 10.177.192.228Up 191.82 GB 4.17% 106338239662793269832304564822427566075 | ^
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 v |
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 | ^
> 172.21.0.26 Up 226.12 GB 4.17% 127605887595351923798765477786913079290 v |
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 | ^
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 v |
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 | ^
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 v |
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> On the node that had a different nodetool ring output:
> {code}
> pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
> Address Status Load Owns Range Ring
> 163051967482949680409533666061055601315
> 172.21.2.222 Up 224.03 GB 4.17% 0 |<--|
> 172.21.2.169 Up 225.93 GB 8.33% 14178431955039102644307275309657008810 | ^
> 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 v |
> 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 | ^
> 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 v |
> 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 | ^
> 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 v |
> 172.21.3.115 Up 231.21 GB 12.50% 70892159775195513221536376548285044050 | ^
> 172.21.1.32 Up 230.38 GB 8.33% 85070591730234615865843651857942052860 v |
> 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^
> 10.177.192.228Up 191.82 GB 8.33% 106338239662793269832304564822427566075 v |
> 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 | ^
> 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 v |
> 172.21.0.26 Up 226 GB 4.17% 127605887595351923798765477786913079290 | ^
> 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 v |
> 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 | ^
> 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 v |
> 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 | ^
> 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->|
> {code}
> As you can see, it was missing 10.177.192.226 from the ring.
> On cass11, everything else looked fine, including nothing in pending/active tpstats.
> However, we did notice an exception on startup in the logs, on cass11
> {code}
> 2011-02-19_19:45:43.26906 INFO - Starting up server gossip
> 2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
> 2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
> 2011-02-19_19:45:43.39747 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
> 2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
> 2011-02-19_19:45:43.39749 at java.io.DataInputStream.readInt(DataInputStream.java:375)
> 2011-02-19_19:45:43.39750 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
> 2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
> 2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
> {code}
> driftx said that it should be harmless, but its the only thing I see different about that node.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira