You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Bruce Schuchardt (JIRA)" <ji...@apache.org> on 2018/05/22 18:02:00 UTC

[jira] [Comment Edited] (GEODE-5238) CI failure: SerialWANPropagationsFeatureDUnitTest.testSerialPropagationWithFilter fails with Possible loss of quorum suspect string

    [ https://issues.apache.org/jira/browse/GEODE-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16484365#comment-16484365 ] 

Bruce Schuchardt edited comment on GEODE-5238 at 5/22/18 6:01 PM:
------------------------------------------------------------------

The log file shows that vm3 logged loss of quorum but this did not initiate forced-disconnect.  Vm3 performed a normal shutdown.  Vm2's "leave" request was received by Vm3 but not before vm3 had decided vm2 had crashed.  
A customer encountering this issue would not experience data loss or loss of service in any way.

If we want to avoid this log message we would need to change ShutdownMessage to require a response.  It used to do this but in 2007 we changed it to not expect a reply in order to marginally speed up shutdown.  If we reverted that change we'd ensure that vm3 processed vm2's shutdown message and avoid the suspect processing that lead to the "quorum lost" declaration.



was (Author: bschuchardt):
The log file shows that vm3 logged loss of quorum but this did not initiate forced-disconnect.  Vm3 performed a normal shutdown.  Vm2's "leave" request was received by Vm3 but not before vm3 had decided vm2 had crashed.  
A customer encountering this issue would not experience data loss or loss of service in any way.

If we want to avoid this log message we would need to change ShutdownMessage to require a response.  It used to do this but in 2007 we changed it to not expect a reply in order to marginally speed up shutdown.  If we reverted that change we'd ensure that vm3 processed vm2's shutdown message and avoid the suspect processing that lead to the "quorum lost" declaration.

This is too risky a change to make in 9.5.1.  We should ignore this suspect string and go ahead with the release.

> CI failure: SerialWANPropagationsFeatureDUnitTest.testSerialPropagationWithFilter fails with Possible loss of quorum suspect string
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-5238
>                 URL: https://issues.apache.org/jira/browse/GEODE-5238
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>            Reporter: Barry Oglesby
>            Assignee: Galen O'Sullivan
>            Priority: Major
>         Attachments: TEST-org.apache.geode.internal.cache.wan.serial.SerialWANPropagationsFeatureDUnitTest.xml
>
>
> The members startup:
> {noformat}
> [vm1] [info 2018/05/19 02:18:54.055 UTC <RMI TCP Connection(3)-172.17.0.5> tid=27] Starting DistributionManager 172.17.0.5(154:locator)<ec><v0>:32771. (took 49 ms)
> [vm2] [info 2018/05/19 02:18:54.479 UTC <RMI TCP Connection(2)-172.17.0.5> tid=27] Starting DistributionManager 172.17.0.5(159)<v1>:32772. (took 336 ms)
> [vm3] [info 2018/05/19 02:18:55.477 UTC <RMI TCP Connection(3)-172.17.0.5> tid=27] Starting DistributionManager 172.17.0.5(168)<v2>:32773. (took 401 ms)
> [vm3] [info 2018/05/19 02:18:55.478 UTC <RMI TCP Connection(3)-172.17.0.5> tid=27] Initial (distribution manager) view = View[172.17.0.5(154:locator)<ec><v0>:32771|2] members: [172.17.0.5(154:locator)<ec><v0>:32771, 172.17.0.5(159)<v1>:32772\{lead}, 172.17.0.5(168)<v2>:32773]
> {noformat}
> So the members are:
> {noformat}
> vm1 - locator, coordinator (weight=3)
> vm2 - lead member (weight=15)
> vm3 - member (weight=10)``````
> vm1 - locator, coordinator (weight=3)
> vm2 - lead member (weight=15)
> vm3 - member (weight=10)
> {noformat}
> The WANTestBase cleanupVM method shuts down the members simultaneously.
> Log messages show vm1 (locator) shutting down first (and the other two members realize it):
> {noformat}
> [vm1] [info 2018/05/19 02:19:10.400 UTC <RMI TCP Connection(4)-172.17.0.5> tid=27] Shutting down DistributionManager 172.17.0.5(154:locator)<ec><v0>:32771. 
> [vm1] [info 2018/05/19 02:19:10.513 UTC <RMI TCP Connection(4)-172.17.0.5> tid=27] Now closing distribution for 172.17.0.5(154:locator)<ec><v0>:32771
> [vm1] [info 2018/05/19 02:19:10.513 UTC <RMI TCP Connection(4)-172.17.0.5> tid=27] Stopping membership services
> [vm1] [info 2018/05/19 02:19:10.513 UTC <RMI TCP Connection(4)-172.17.0.5> tid=27] Now closing distribution for 172.17.0.5(154:locator)<ec><v0>:32771
> [vm2] [info 2018/05/19 02:19:10.420 UTC <Pooled High Priority Message Processor 1> tid=310] Member at 172.17.0.5(154:locator)<ec><v0>:32771 gracefully left the distributed cache: shutdown message received
> [vm3] [info 2018/05/19 02:19:10.622 UTC <Pooled High Priority Message Processor 1> tid=310] Member at 172.17.0.5(154:locator)<ec><v0>:32771 gracefully left the distributed cache: shutdown message received
> {noformat}
> Then vm2 becomes coordinator:
> {noformat}
> [vm2] [info 2018/05/19 02:19:10.402 UTC <Pooled High Priority Message Processor 1> tid=310] This member is becoming the membership coordinator with address 172.17.0.5(159)<v1>:32772
> {noformat}
> Then vm2 shuts down its DistributionManager:
> {noformat}
> [vm2] [info 2018/05/19 02:19:11.239 UTC <RMI TCP Connection(2)-172.17.0.5> tid=27] Shutting down DistributionManager 172.17.0.5(159)<v1>:32772. 
> [vm2] [info 2018/05/19 02:19:11.352 UTC <RMI TCP Connection(2)-172.17.0.5> tid=27] Now closing distribution for 172.17.0.5(159)<v1>:32772
> [vm2] [info 2018/05/19 02:19:11.352 UTC <RMI TCP Connection(2)-172.17.0.5> tid=27] Stopping membership services
> [vm2] [info 2018/05/19 02:19:11.377 UTC <RMI TCP Connection(2)-172.17.0.5> tid=27] DistributionManager stopped in 138ms.
> [vm2] [info 2018/05/19 02:19:11.377 UTC <RMI TCP Connection(2)-172.17.0.5> tid=27] Marking DistributionManager 172.17.0.5(159)<v1>:32772 as closed.
> {noformat}
> Then vm3 does some suspect checking and logs the quorum loss warning:
> {noformat}
> [vm3] [info 2018/05/19 02:19:11.414 UTC <P2P message reader for 172.17.0.5(159)<v1>:32772 shared unordered uid=14 port=35502> tid=296] Performing final check for suspect member 172.17.0.5(159)<v1>:32772 reason=member unexpectedly shut down shared, unordered connection
> [vm3] [info 2018/05/19 02:19:11.434 UTC <P2P message reader for 172.17.0.5(159)<v1>:32772 shared unordered uid=14 port=35502> tid=296] This member is becoming the membership coordinator with address 172.17.0.5(168)<v2>:32773
> [vm3] [info 2018/05/19 02:19:11.436 UTC <Geode Membership View Creator> tid=360] View Creator thread is starting
> [vm3] [info 2018/05/19 02:19:11.436 UTC <Geode Membership View Creator> tid=360] 172.17.0.5(159)<v1>:32772 had a weight of 15
> [vm3] [warn 2018/05/19 02:19:11.436 UTC <Geode Membership View Creator> tid=360] total weight lost in this view change is 15 of 25. Quorum has been lost!
> [vm3] [fatal 2018/05/19 02:19:11.438 UTC <Geode Membership View Creator> tid=360] Possible loss of quorum due to the loss of 1 cache processes: [172.17.0.5(159)<v1>:32772]
> {noformat}
> Then vm3 shuts down its DistributionManager:
> {noformat}
> [vm3] [info 2018/05/19 02:19:11.448 UTC <RMI TCP Connection(3)-172.17.0.5> tid=27] Shutting down DistributionManager 172.17.0.5(168)<v2>:32773. 
> [vm3] [info 2018/05/19 02:19:11.588 UTC <RMI TCP Connection(3)-172.17.0.5> tid=27] Now closing distribution for 172.17.0.5(168)<v2>:32773
> [vm3] [info 2018/05/19 02:19:11.588 UTC <RMI TCP Connection(3)-172.17.0.5> tid=27] Stopping membership services
> [vm3] [info 2018/05/19 02:19:11.617 UTC <RMI TCP Connection(3)-172.17.0.5> tid=27] DistributionManager stopped in 168ms.
> [vm3] [info 2018/05/19 02:19:11.617 UTC <RMI TCP Connection(3)-172.17.0.5> tid=27] Marking DistributionManager 172.17.0.5(168)<v2>:32773 as closed.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)