You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Paul Querna (JIRA)" <ji...@apache.org> on 2011/02/19 22:08:38 UTC

[jira] Created: (CASSANDRA-2201) Gossip synchronization issues

Gossip synchronization issues
-----------------------------

                 Key: CASSANDRA-2201
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
             Project: Cassandra
          Issue Type: Bug
    Affects Versions: 0.6.12
         Environment: r1071793 (0.6.12)
Ubuntu 9.10
24 node cluster.
JNA enabled.

java -version
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)

            Reporter: Paul Querna


After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.

It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip

Correct nodetool ring output:
{{
pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
Address       Status     Load          Owns    Range                                      Ring
                                               163051967482949680409533666061055601315    
172.21.2.222  Up         224.03 GB     4.17%   0                                          |<--|
10.177.192.88 Up         219.28 GB     4.17%   7089215977519551322153637654828504405      |   ^
172.21.2.169  Up         225.93 GB     4.17%   14178431955039102644307275309657008810     v   |
10.177.192.89 Up         225.91 GB     4.17%   21267647932558653966460912964485513215     |   ^
172.21.3.116  Up         226.88 GB     4.17%   28356863910078205288614550619314017620     v   |
10.177.192.90 Up         219.2 GB      4.17%   35446079887597756610768188274142522025     |   ^
172.21.2.173  Up         227.44 GB     4.17%   42535295865117307932921825928971026430     v   |
10.177.192.91 Up         182.44 GB     4.17%   49624511842636859255075463583799530835     |   ^
172.21.2.223  Up         229.38 GB     4.17%   56713727820156410577229101238628035240     v   |
10.177.192.225Up         193.1 GB      4.17%   63802943797675961899382738893456539645     |   ^
172.21.3.115  Up         231.21 GB     4.17%   70892159775195513221536376548285044050     v   |
10.177.192.226Up         194.33 GB     4.17%   77981375752715064543690014203113548455     |   ^
172.21.1.32   Up         230.38 GB     4.17%   85070591730234615865843651857942052860     v   |
10.177.192.227Up         196.34 GB     4.17%   92159807707754167187997289512770557265     |   ^
172.21.2.224  Up         205.9 GB      4.17%   99249023685273718510150927167599061670     v   |
10.177.192.228Up         191.82 GB     4.17%   106338239662793269832304564822427566075    |   ^
172.21.3.117  Up         230.5 GB      4.17%   113427455640312821154458202477256070480    v   |
10.177.192.229Up         193.2 GB      4.17%   120516671617832372476611840132084574885    |   ^
172.21.0.26   Up         226.12 GB     4.17%   127605887595351923798765477786913079290    v   |
10.177.192.230Up         187.28 GB     4.17%   134695103572871475120919115441741583695    |   ^
172.21.2.225  Up         230.34 GB     4.17%   141784319550391026443072753096570088100    v   |
10.177.192.231Up         188.05 GB     4.17%   148873535527910577765226390751398592505    |   ^
172.21.3.119  Up         215.91 GB     4.17%   155962751505430129087380028406227096910    v   |
10.177.192.232Up         217.41 GB     4.17%   163051967482949680409533666061055601315    |-->|
}}


On the node that had a different nodetool ring output:
{{
pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
Address       Status     Load          Owns    Range                                      Ring
                                               163051967482949680409533666061055601315    
172.21.2.222  Up         224.03 GB     4.17%   0                                          |<--|
172.21.2.169  Up         225.93 GB     8.33%   14178431955039102644307275309657008810     |   ^
10.177.192.89 Up         225.91 GB     4.17%   21267647932558653966460912964485513215     v   |
172.21.3.116  Up         226.88 GB     4.17%   28356863910078205288614550619314017620     |   ^
10.177.192.90 Up         219.2 GB      4.17%   35446079887597756610768188274142522025     v   |
172.21.2.173  Up         227.44 GB     4.17%   42535295865117307932921825928971026430     |   ^
10.177.192.91 Up         182.44 GB     4.17%   49624511842636859255075463583799530835     v   |
172.21.3.115  Up         231.21 GB     12.50%  70892159775195513221536376548285044050     |   ^
172.21.1.32   Up         230.38 GB     8.33%   85070591730234615865843651857942052860     v   |
10.177.192.227Up         196.34 GB     4.17%   92159807707754167187997289512770557265     |   ^
10.177.192.228Up         191.82 GB     8.33%   106338239662793269832304564822427566075    v   |
172.21.3.117  Up         230.5 GB      4.17%   113427455640312821154458202477256070480    |   ^
10.177.192.229Up         193.2 GB      4.17%   120516671617832372476611840132084574885    v   |
172.21.0.26   Up         226 GB        4.17%   127605887595351923798765477786913079290    |   ^
10.177.192.230Up         187.28 GB     4.17%   134695103572871475120919115441741583695    v   |
172.21.2.225  Up         230.34 GB     4.17%   141784319550391026443072753096570088100    |   ^
10.177.192.231Up         188.05 GB     4.17%   148873535527910577765226390751398592505    v   |
172.21.3.119  Up         215.91 GB     4.17%   155962751505430129087380028406227096910    |   ^
10.177.192.232Up         217.41 GB     4.17%   163051967482949680409533666061055601315    |-->|
}}

As you can see, it was missing 10.177.192.226 from the ring.

On cass11, everything else looked fine, including nothing in pending/active tpstats.

However, we did notice an exception on startup in the logs, on cass11
{{
2011-02-19_19:45:43.26906 INFO - Starting up server gossip
2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
2011-02-19_19:45:43.39747       at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
2011-02-19_19:45:43.39749       at java.io.DataInputStream.readInt(DataInputStream.java:375)
2011-02-19_19:45:43.39750       at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
}}

driftx said that it should be harmless, but its the only thing I see different about that node.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2201) Gossip synchronization issues

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12997966#comment-12997966 ] 

Gary Dusbabek commented on CASSANDRA-2201:
------------------------------------------

With Brandon's comment in mind, a debug log taken from the node after issuing `nodetool ring` would be helpful. It lists the tokens the node knows about (in StorageService).  I'm pretty sure this is happening because StorageService.tokenMetadata_ has incomplete information in it.

> Gossip synchronization issues
> -----------------------------
>
>                 Key: CASSANDRA-2201
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.6.12
>         Environment: r1071793 (0.6.12)
> Ubuntu 9.10
> 24 node cluster.
> JNA enabled.
> java -version
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
>            Reporter: Paul Querna
>            Assignee: Brandon Williams
>         Attachments: CASSANDRA-2201-jstack.txt, CASSANDRA-2201-startup-trace.txt
>
>
> After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries.
> It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip
> Correct nodetool ring output:
> {code}
> pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
> Address       Status     Load          Owns    Range                                      Ring
>                                                163051967482949680409533666061055601315    
> 172.21.2.222  Up         224.03 GB     4.17%   0                                          |<--|
> 10.177.192.88 Up         219.28 GB     4.17%   7089215977519551322153637654828504405      |   ^
> 172.21.2.169  Up         225.93 GB     4.17%   14178431955039102644307275309657008810     v   |
> 10.177.192.89 Up         225.91 GB     4.17%   21267647932558653966460912964485513215     |   ^
> 172.21.3.116  Up         226.88 GB     4.17%   28356863910078205288614550619314017620     v   |
> 10.177.192.90 Up         219.2 GB      4.17%   35446079887597756610768188274142522025     |   ^
> 172.21.2.173  Up         227.44 GB     4.17%   42535295865117307932921825928971026430     v   |
> 10.177.192.91 Up         182.44 GB     4.17%   49624511842636859255075463583799530835     |   ^
> 172.21.2.223  Up         229.38 GB     4.17%   56713727820156410577229101238628035240     v   |
> 10.177.192.225Up         193.1 GB      4.17%   63802943797675961899382738893456539645     |   ^
> 172.21.3.115  Up         231.21 GB     4.17%   70892159775195513221536376548285044050     v   |
> 10.177.192.226Up         194.33 GB     4.17%   77981375752715064543690014203113548455     |   ^
> 172.21.1.32   Up         230.38 GB     4.17%   85070591730234615865843651857942052860     v   |
> 10.177.192.227Up         196.34 GB     4.17%   92159807707754167187997289512770557265     |   ^
> 172.21.2.224  Up         205.9 GB      4.17%   99249023685273718510150927167599061670     v   |
> 10.177.192.228Up         191.82 GB     4.17%   106338239662793269832304564822427566075    |   ^
> 172.21.3.117  Up         230.5 GB      4.17%   113427455640312821154458202477256070480    v   |
> 10.177.192.229Up         193.2 GB      4.17%   120516671617832372476611840132084574885    |   ^
> 172.21.0.26   Up         226.12 GB     4.17%   127605887595351923798765477786913079290    v   |
> 10.177.192.230Up         187.28 GB     4.17%   134695103572871475120919115441741583695    |   ^
> 172.21.2.225  Up         230.34 GB     4.17%   141784319550391026443072753096570088100    v   |
> 10.177.192.231Up         188.05 GB     4.17%   148873535527910577765226390751398592505    |   ^
> 172.21.3.119  Up         215.91 GB     4.17%   155962751505430129087380028406227096910    v   |
> 10.177.192.232Up         217.41 GB     4.17%   163051967482949680409533666061055601315    |-->|
> {code}
> On the node that had a different nodetool ring output:
> {code}
> pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
> Address       Status     Load          Owns    Range                                      Ring
>                                                163051967482949680409533666061055601315    
> 172.21.2.222  Up         224.03 GB     4.17%   0                                          |<--|
> 172.21.2.169  Up         225.93 GB     8.33%   14178431955039102644307275309657008810     |   ^
> 10.177.192.89 Up         225.91 GB     4.17%   21267647932558653966460912964485513215     v   |
> 172.21.3.116  Up         226.88 GB     4.17%   28356863910078205288614550619314017620     |   ^
> 10.177.192.90 Up         219.2 GB      4.17%   35446079887597756610768188274142522025     v   |
> 172.21.2.173  Up         227.44 GB     4.17%   42535295865117307932921825928971026430     |   ^
> 10.177.192.91 Up         182.44 GB     4.17%   49624511842636859255075463583799530835     v   |
> 172.21.3.115  Up         231.21 GB     12.50%  70892159775195513221536376548285044050     |   ^
> 172.21.1.32   Up         230.38 GB     8.33%   85070591730234615865843651857942052860     v   |
> 10.177.192.227Up         196.34 GB     4.17%   92159807707754167187997289512770557265     |   ^
> 10.177.192.228Up         191.82 GB     8.33%   106338239662793269832304564822427566075    v   |
> 172.21.3.117  Up         230.5 GB      4.17%   113427455640312821154458202477256070480    |   ^
> 10.177.192.229Up         193.2 GB      4.17%   120516671617832372476611840132084574885    v   |
> 172.21.0.26   Up         226 GB        4.17%   127605887595351923798765477786913079290    |   ^
> 10.177.192.230Up         187.28 GB     4.17%   134695103572871475120919115441741583695    v   |
> 172.21.2.225  Up         230.34 GB     4.17%   141784319550391026443072753096570088100    |   ^
> 10.177.192.231Up         188.05 GB     4.17%   148873535527910577765226390751398592505    v   |
> 172.21.3.119  Up         215.91 GB     4.17%   155962751505430129087380028406227096910    |   ^
> 10.177.192.232Up         217.41 GB     4.17%   163051967482949680409533666061055601315    |-->|
> {code}
> As you can see, it was missing 10.177.192.226 from the ring.
> On cass11, everything else looked fine, including nothing in pending/active tpstats.
> However, we did notice an exception on startup in the logs, on cass11
> {code}
> 2011-02-19_19:45:43.26906 INFO - Starting up server gossip
> 2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main]
> 2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
> 2011-02-19_19:45:43.39747       at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
> 2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
> 2011-02-19_19:45:43.39749       at java.io.DataInputStream.readInt(DataInputStream.java:375)
> 2011-02-19_19:45:43.39750       at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
> 2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
> 2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
> {code}
> driftx said that it should be harmless, but its the only thing I see different about that node.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira