You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Dan Smith (Jira)" <ji...@apache.org> on 2020/10/27 21:41:00 UTC
[jira] [Commented] (GEODE-8659)
MembershipIntegrationTest.secondMembershipPausesForLocatorWaitTime fails
with ConditionTimeoutException
[ https://issues.apache.org/jira/browse/GEODE-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221777#comment-17221777 ]
Dan Smith commented on GEODE-8659:
----------------------------------
This appears to be a race condition during the join process. In the logs, there are two locators, one with port 41002 (locator A) and one with port 41003 (locator B). Locator A is configured with a locator-wait-time and the address of locator B. Locator B has no wait time and no knowledge of other locators.
What I see in the test is this:
1. Locator A starts up. It is spinning and checking for Locator B to start
{noformat}
info 2020/10/26 17:42:42.970 GMT <pool-3-thread-1> tid=93] Unable to contact locator HostAndPort[5199b02bbb6f:32857]: java.io.EOFException: Locator at HostAndPort[5199b02bbb6f:32857] did not respond. This is normal if the locator was shutdown. If it wasn't check its log for exceptions.
[info 2020/10/26 17:42:43.975 GMT <pool-3-thread-1> tid=93] Discovery state after looking for membership coordinator is locatorsContacted=1; findInViewResponses=0; alreadyTried=[]; registrants=[172.17.0.34(1:locator)<ec>:41002]; possibleCoordinator=172.17.0.34(1:locator)<ec>:41002; viewId=-1; hasContactedAJoinedLocator=false; view=null; responses=[]
{noformat}
2. Locator B starts. It creates a distributed system and becomes the coordinator
{noformat}
[info 2020/10/26 17:42:44.818 GMT <Test worker> tid=27] Starting membership services
[info 2020/10/26 17:42:45.003 GMT <Test worker> tid=27] This member is becoming the membership coordinator with address 172.17.0.34(1:locator)<ec>:41003
[info 2020/10/26 17:42:45.010 GMT <Test worker> tid=27] Finished joining (took 60ms).
{noformat}
3. Locator A manages to contact Locator B after it has created the system but *before* it has installed a view. This means it gets an interesting response. It is not a failure, but it also does not have any joined members.
{noformat}
[info 2020/10/26 17:42:45.013 GMT <pool-3-thread-1> tid=93] received FindCoordinatorResponse(coordinator=172.17.0.34(1:locator)<ec><v0>:41003, fromView=false, viewId=null, registrants=[], senderId=172.17.0.34(1:locator)<ec><v0>:41003, network partition detection enabled=true, locators preferred as coordinators=true, view=null) from locator HostAndPort[5199b02bbb6f:32857]
[info 2020/10/26 17:42:45.013 GMT <pool-3-thread-1> tid=93] Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join
[info 2020/10/26 17:42:45.022 GMT <pool-3-thread-1> tid=93] received FindCoordinatorResponse(coordinator=172.17.0.34(1:locator)<ec>:41002, fromView=false, viewId=null, registrants=[172.17.0.34(1:locator)<ec>:41002], senderId=172.17.0.34(1:locator)<ec>:41002, network partition detection enabled=true, locators preferred as coordinators=true, view=null) from locator HostAndPort[5199b02bbb6f:46597]
[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] findCoordinator chose 172.17.0.34(1:locator)<ec>:41002 out of these possible coordinators: [172.17.0.34(1:locator)<ec><v0>:41003, 172.17.0.34(1:locator)<ec>:41002]
[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] Discovery state after looking for membership coordinator is locatorsContacted=2; findInViewResponses=0; alreadyTried=[]; registrants=[172.17.0.34(1:locator)<ec>:41002]; possibleCoordinator=172.17.0.34(1:locator)<ec>:41002; viewId=-1; hasContactedAJoinedLocator=true; view=null; responses=[]
{noformat}
4. It appears because Locator A sees that it has contacted all of the locators, and there is not actually a view in that response, it decides it should become the coordinator as well, creating a split brain.
{noformat}
[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] found possible coordinator 172.17.0.34(1:locator)<ec>:41002
[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] This member is becoming the membership coordinator with address 172.17.0.34(1:locator)<ec>:41002
{noformat}
The problem seems to be the above response. Locator B is sending a response to Locator A, but that is not informing Locator A that it should not become the coordinator. It believe it's passing through this check because *joinedMembersContacted* is still zero:
{code}
if (state.joinedMembersContacted <= 0 && ((now >= locatorGiveUpTime &&
tries >= minimumRetriesBeforeBecomingCoordinator) ||
state.locatorsContacted >= locators.size())) {
synchronized (viewInstallationLock) {
becomeCoordinator();
}
{code}
There is *another* flag on SearchState that we are not checking in the above block, which is *hasContactedAJoinedLocator*. So maybe we could check that as well? However, it's not clear to me that it will definitely always be true in the above case. Ideally, a locator will either (A) not respond, or (B) respond and affirmatively indicate that it's becoming the coordinator in this case.
> MembershipIntegrationTest.secondMembershipPausesForLocatorWaitTime fails with ConditionTimeoutException
> -------------------------------------------------------------------------------------------------------
>
> Key: GEODE-8659
> URL: https://issues.apache.org/jira/browse/GEODE-8659
> Project: Geode
> Issue Type: Bug
> Components: membership, tests
> Reporter: Kirk Lund
> Priority: Major
>
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-5667/test-results/integrationTest/1603735181/
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test report artifacts from this job are available at:
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-5667/test-artifacts/1603735181/integrationtestfiles-geode-pr-5667.tgz
> {noformat}
> org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest > secondMembershipPausesForLocatorWaitTime FAILED
> org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a lambda expression in org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest that uses org.apache.geode.distributed.internal.membership.api.Membership
> Expected size:<2> but was:<1> in:
> <[172.17.0.34(1:locator)<ec><v0>:41003]> within 5 minutes.
> at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:165)
> at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
> at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
> at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:895)
> at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:679)
> at org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest.secondMembershipPausesForLocatorWaitTime(MembershipIntegrationTest.java:270)
> Caused by:
> java.lang.AssertionError:
> Expected size:<2> but was:<1> in:
> <[172.17.0.34(1:locator)<ec><v0>:41003]>
> at org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest.lambda$secondMembershipPausesForLocatorWaitTime$8(MembershipIntegrationTest.java:271)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)