You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Dan Smith (Jira)" <ji...@apache.org> on 2020/10/27 21:41:00 UTC

[jira] [Commented] (GEODE-8659) MembershipIntegrationTest.secondMembershipPausesForLocatorWaitTime fails with ConditionTimeoutException

    [ https://issues.apache.org/jira/browse/GEODE-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221777#comment-17221777 ] 

Dan Smith commented on GEODE-8659:
----------------------------------

This appears to be a race condition during the join process. In the logs, there are two locators, one with port 41002 (locator A) and one with port 41003 (locator B). Locator A is configured with a locator-wait-time and the address of locator B. Locator B has no wait time and no knowledge of other locators.

What I see in the test is this:
1. Locator A starts up. It is spinning and checking for Locator B to start
{noformat}
info 2020/10/26 17:42:42.970 GMT <pool-3-thread-1> tid=93] Unable to contact locator HostAndPort[5199b02bbb6f:32857]: java.io.EOFException: Locator at HostAndPort[5199b02bbb6f:32857] did not respond. This is normal if the locator was shutdown. If it wasn't check its log for exceptions.


[info 2020/10/26 17:42:43.975 GMT <pool-3-thread-1> tid=93] Discovery state after looking for membership coordinator is locatorsContacted=1; findInViewResponses=0; alreadyTried=[]; registrants=[172.17.0.34(1:locator)<ec>:41002]; possibleCoordinator=172.17.0.34(1:locator)<ec>:41002; viewId=-1; hasContactedAJoinedLocator=false; view=null; responses=[]

{noformat}
2. Locator B starts. It creates a distributed system and becomes the coordinator
{noformat}
[info 2020/10/26 17:42:44.818 GMT <Test worker> tid=27] Starting membership services

[info 2020/10/26 17:42:45.003 GMT <Test worker> tid=27] This member is becoming the membership coordinator with address 172.17.0.34(1:locator)<ec>:41003

[info 2020/10/26 17:42:45.010 GMT <Test worker> tid=27] Finished joining (took 60ms).

{noformat}

3. Locator A manages to contact Locator B after it has created the system but *before* it has installed a view. This means it gets an interesting response. It is not a failure, but it also does not have any joined members.
{noformat}
[info 2020/10/26 17:42:45.013 GMT <pool-3-thread-1> tid=93] received FindCoordinatorResponse(coordinator=172.17.0.34(1:locator)<ec><v0>:41003, fromView=false, viewId=null, registrants=[], senderId=172.17.0.34(1:locator)<ec><v0>:41003, network partition detection enabled=true, locators preferred as coordinators=true, view=null) from locator HostAndPort[5199b02bbb6f:32857]

[info 2020/10/26 17:42:45.013 GMT <pool-3-thread-1> tid=93] Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join

[info 2020/10/26 17:42:45.022 GMT <pool-3-thread-1> tid=93] received FindCoordinatorResponse(coordinator=172.17.0.34(1:locator)<ec>:41002, fromView=false, viewId=null, registrants=[172.17.0.34(1:locator)<ec>:41002], senderId=172.17.0.34(1:locator)<ec>:41002, network partition detection enabled=true, locators preferred as coordinators=true, view=null) from locator HostAndPort[5199b02bbb6f:46597]

[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] findCoordinator chose 172.17.0.34(1:locator)<ec>:41002 out of these possible coordinators: [172.17.0.34(1:locator)<ec><v0>:41003, 172.17.0.34(1:locator)<ec>:41002]

[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] Discovery state after looking for membership coordinator is locatorsContacted=2; findInViewResponses=0; alreadyTried=[]; registrants=[172.17.0.34(1:locator)<ec>:41002]; possibleCoordinator=172.17.0.34(1:locator)<ec>:41002; viewId=-1; hasContactedAJoinedLocator=true; view=null; responses=[]

{noformat}

4. It appears because Locator A sees that it has contacted all of the locators, and there is not actually a view in that response, it decides it should become the coordinator as well, creating a split brain.
{noformat}
[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] found possible coordinator 172.17.0.34(1:locator)<ec>:41002

[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] This member is becoming the membership coordinator with address 172.17.0.34(1:locator)<ec>:41002

{noformat}

The problem seems to be the above response. Locator B is sending a response to Locator A, but that is not informing Locator A that it should not become the coordinator. It believe it's passing through this check because *joinedMembersContacted* is still zero:
{code}
            if (state.joinedMembersContacted <= 0 && ((now >= locatorGiveUpTime &&
                tries >= minimumRetriesBeforeBecomingCoordinator) ||
                state.locatorsContacted >= locators.size())) {
              synchronized (viewInstallationLock) {
                becomeCoordinator();
              }

{code}

There is *another* flag on SearchState that we are not checking in the above block, which is *hasContactedAJoinedLocator*. So maybe we could check that as well? However, it's not clear to me that it will definitely always be true in the above case. Ideally, a locator will either (A) not respond, or (B) respond and affirmatively indicate that it's becoming the coordinator in this case.

> MembershipIntegrationTest.secondMembershipPausesForLocatorWaitTime fails with ConditionTimeoutException
> -------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-8659
>                 URL: https://issues.apache.org/jira/browse/GEODE-8659
>             Project: Geode
>          Issue Type: Bug
>          Components: membership, tests
>            Reporter: Kirk Lund
>            Priority: Major
>
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-5667/test-results/integrationTest/1603735181/
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test report artifacts from this job are available at:
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-5667/test-artifacts/1603735181/integrationtestfiles-geode-pr-5667.tgz
> {noformat}
> org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest > secondMembershipPausesForLocatorWaitTime FAILED
>     org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a lambda expression in org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest that uses org.apache.geode.distributed.internal.membership.api.Membership 
>     Expected size:<2> but was:<1> in:
>     <[172.17.0.34(1:locator)<ec><v0>:41003]> within 5 minutes.
>         at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:165)
>         at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
>         at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
>         at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:895)
>         at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:679)
>         at org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest.secondMembershipPausesForLocatorWaitTime(MembershipIntegrationTest.java:270)
>         Caused by:
>         java.lang.AssertionError: 
>         Expected size:<2> but was:<1> in:
>         <[172.17.0.34(1:locator)<ec><v0>:41003]>
>             at org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest.lambda$secondMembershipPausesForLocatorWaitTime$8(MembershipIntegrationTest.java:271)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)