You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Joel Knighton (JIRA)" <ji...@apache.org> on 2017/04/06 17:02:42 UTC

[jira] [Comment Edited] (CASSANDRA-13407) test failure at RemoveTest.testBadHostId

    [ https://issues.apache.org/jira/browse/CASSANDRA-13407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959301#comment-15959301 ] 

Joel Knighton edited comment on CASSANDRA-13407 at 4/6/17 5:02 PM:
-------------------------------------------------------------------

For posterity, this is the race possible when the Gossiper is started, as far as I can tell.

In setup, we initialize a fake ring using Util.createInitialRing. This will intialize the nodes in an unsafe manner and then inject the token states. If a status check runs before the tokens state is set, the previously decommissioned node will look like a fat client, since it won't have tokens and will not have a DEAD_STATE. Since we aren't gossiping, we won't have heard from it in greater than fatClientTimeout, so we'll remove it. If this races with the ss.onChange in createInitialRing, we can remove the endpointstate while processing it, which will cause a NPE as above.

We also need to remove SchemaLoader.loadSchema() as you did in the patch - this is because it starts the Gossiper as well. This is fine; we don't appear to need it.

The patch looks good - the race exists in theory on 2.1/2.2, but it appears to only manifest on 3.0+. I don't think it is worth committing to 2.1 for that reason - let's do 2.2+ forward and run the test at least once on each branch before committing.




was (Author: jkni):
For posterity, this is the race possible when the Gossiper is started, as far as I can tell.

In setup, we initialize a fake ring using Util.createInitialRing. This will intialize the nodes in an unsafe manner and then inject the token states. If a status check runs before the tokens state is set, the previously decommissioned node will look like a fat client, since it won't have tokens and will not have a DEAD_STATE. Since we aren't gossiping, we won't have heard from it in greater than fatClientTimeout, so we'll remove it. If this races with the ss.onChange in createInitialRing, we can remove the endpointstate while processing it, which will cause a NPE as above. This race can be seen at 16:15:51,205 in the log linked from the test failure.

We also need to remove SchemaLoader.loadSchema() as you did in the patch - this is because it starts the Gossiper as well. This is fine; we don't appear to need it.

The patch looks good - the race exists in theory on 2.1/2.2, but it appears to only manifest on 3.0+. I don't think it is worth committing to 2.1 for that reason - let's do 2.2+ forward and run the test at least once on each branch before committing.



> test failure at RemoveTest.testBadHostId
> ----------------------------------------
>
>                 Key: CASSANDRA-13407
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13407
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Alex Petrov
>            Assignee: Alex Petrov
>
> Example trace:
> {code}
> java.lang.NullPointerException
> 	at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:881)
> 	at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:876)
> 	at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2201)
> 	at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1855)
> 	at org.apache.cassandra.Util.createInitialRing(Util.java:216)
> 	at org.apache.cassandra.service.RemoveTest.setup(RemoveTest.java:89)
> {code} 
> [failure example|https://cassci.datastax.com/job/trunk_testall/1491/testReport/org.apache.cassandra.service/RemoveTest/testBadHostId/]
> [history|https://cassci.datastax.com/job/trunk_testall/lastCompletedBuild/testReport/org.apache.cassandra.service/RemoveTest/testBadHostId/history/]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)