You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Gianluca Righetto (Jira)" <ji...@apache.org> on 2020/04/10 16:12:00 UTC

[jira] [Comment Edited] (CASSANDRA-15551) Fix flaky tests org.apache.cassandra.service.MoveTest testStateJumpToNormal and testMoveWithPendingRangesNetworkStrategyRackAwareThirtyNodes

    [ https://issues.apache.org/jira/browse/CASSANDRA-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080606#comment-17080606 ] 

Gianluca Righetto edited comment on CASSANDRA-15551 at 4/10/20, 4:11 PM:
-------------------------------------------------------------------------

The issue here is that once the this line is executed in MoveTest's @Before method, {{StorageService.instance.getTokenMetadata().clearUnsafe()}}, the {{GossipStage}} thread kicks in and starts evicting the stale endpoints from membership, which may happen in parallel while another test method is already running.

To reproduce this in an IDE, you can set breakpoints at:

[https://github.com/apache/cassandra/blob/1ce3c1c039561c15892115af37e0c7abf260bc6b/test/unit/org/apache/cassandra/Util.java#L222]

and

[https://github.com/apache/cassandra/blob/1ce3c1c039561c15892115af37e0c7abf260bc6b/src/java/org/apache/cassandra/gms/Gossiper.java#L524]

If the main thread starts executing the second iteration of the loop in {{createInitialRing}} while the GossipStage thread is removing the endpoints in {{evictFromMembership}}, it will throw a NPE down the road.

The fix I submitted basically makes the main thread wait for all endpoints to be evicted in between tests, such that the next test starts in a clean state.

Pull request: [https://github.com/apache/cassandra/pull/533]
 Java 11 Unit Tests results: [https://circleci.com/gh/grighetto/cassandra/68]
 Java 8 Unit Tests results: [https://circleci.com/gh/grighetto/cassandra/65]


was (Author: gianluca):
The issue here is that once the this line is executed in the @Before setup method, {{StorageService.instance.getTokenMetadata().clearUnsafe()}}, the {{GossipStage}} thread kicks in and starts evicting the stale endpoints from membership, which may happen in parallel while another test method is already running.

To reproduce this in an IDE, you can set breakpoints at:

https://github.com/apache/cassandra/blob/1ce3c1c039561c15892115af37e0c7abf260bc6b/test/unit/org/apache/cassandra/Util.java#L222

and

https://github.com/apache/cassandra/blob/1ce3c1c039561c15892115af37e0c7abf260bc6b/src/java/org/apache/cassandra/gms/Gossiper.java#L524

If the main thread starts executing the second iteration of the loop in {{createInitialRing}} while the GossipStage thread is removing the endpoints in {{evictFromMembership}}, it will throw a NPE down the road.

The fix I submitted basically makes the main thread wait for all endpoints to be evicted in between tests, such that the next test starts in a clean state.

Pull request: https://github.com/apache/cassandra/pull/533
Java 11 Unit Tests results: https://circleci.com/gh/grighetto/cassandra/68
Java 8 Unit Tests results: https://circleci.com/gh/grighetto/cassandra/65

> Fix flaky tests org.apache.cassandra.service.MoveTest testStateJumpToNormal and testMoveWithPendingRangesNetworkStrategyRackAwareThirtyNodes
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15551
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15551
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/unit
>            Reporter: David Capwell
>            Assignee: Gianluca Righetto
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 4.0-alpha
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> testStateJumpToNormal failure was on java 11
> {code}
> java.lang.NullPointerException
> 	at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1028)
> 	at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1023)
> 	at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2513)
> 	at org.apache.cassandra.service.StorageService.onChange(StorageService.java:2055)
> 	at org.apache.cassandra.Util.createInitialRing(Util.java:225)
> 	at org.apache.cassandra.service.MoveTest.testStateJumpToNormal(MoveTest.java:935)
> 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> {code}
> testMoveWithPendingRangesNetworkStrategyRackAwareThirtyNodes failure was on java 8
> {code}
> java.lang.NullPointerException
> 	at org.apache.cassandra.service.StorageService.updatePeerInfo(StorageService.java:2174)
> 	at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2511)
> 	at org.apache.cassandra.service.StorageService.onChange(StorageService.java:2055)
> 	at org.apache.cassandra.Util.createInitialRing(Util.java:225)
> 	at org.apache.cassandra.service.MoveTest.testMoveWithPendingRangesNetworkStrategyRackAwareThirtyNodes(MoveTest.java:199)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org