You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Bruce Schuchardt (JIRA)" <ji...@apache.org> on 2017/06/09 17:24:18 UTC

[jira] [Reopened] (GEODE-3052) Restarting 2 locators within 1s of each other causes potential locator split brain

     [ https://issues.apache.org/jira/browse/GEODE-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bruce Schuchardt reopened GEODE-3052:
-------------------------------------

If there were servers in the recovered view locators can still come up in a split-brain configuration with the fix I checked in yesterday.

{noformat}
locator1/locator1.log: [info 2017/06/09 10:13:29.365 PDT locator1 <main> tid=0x1] Peer locator recovering from /export/trout1/users/bschuchardt/devel/testing/splitbrain/locator1/locator12345view.dat

locator1/locator1.log: [info 2017/06/09 10:13:29.367 PDT locator1 <main> tid=0x1] Peer locator initial membership is View[trout(locator1:9450:locator)<ec><v0>:1024|3] members: [trout(server2:9684)<v2>:1026{lead}, trout(server1:9796)<v3>:1027]




locator2/locator2.log: [info 2017/06/09 10:13:30.093 PDT locator2 <main> tid=0x1] Attempting to join the distributed system through coordinator 10.118.26.122(server2:9684)<v2>:1026 using address 10.118.26.122(locator2:9961:locator)<ec>:1025


locator1/locator1.log: [info 2017/06/09 10:13:38.678 PDT locator1 <main> tid=0x1] This member is becoming the membership coordinator with address 10.118.26.122(locator1:9937:locator)<ec>:1024

locator1/locator1.log: [info 2017/06/09 10:13:38.679 PDT locator1 <main> tid=0x1] received new view: View[10.118.26.122(locator1:9937:locator)<ec><v0>:1024|0] members: [10.118.26.122(locator1:9937:locator)<ec><v0>:1024]
  old view is: null

locator1/locator1.log: [info 2017/06/09 10:13:38.679 PDT locator1 <main> tid=0x1] Peer locator received new membership view: View[10.118.26.122(locator1:9937:locator)<ec><v0>:1024|0] members: [10.118.26.122(locator1:9937:locator)<ec><v0>:1024]

locator1/locator1.log: [info 2017/06/09 10:13:38.692 PDT locator1 <main> tid=0x1] ViewCreator starting on:10.118.26.122(locator1:9937:locator)<ec><v0>:1024

locator1/locator1.log: [info 2017/06/09 10:13:38.692 PDT locator1 <Geode Membership View Creator> tid=0x24] View Creator thread is starting

locator1/locator1.log: [info 2017/06/09 10:13:38.693 PDT locator1 <main> tid=0x1] Finished joining (took 9030ms).

locator1/locator1.log: [info 2017/06/09 10:13:38.694 PDT locator1 <Geode Membership View Creator> tid=0x24] no recipients for new view aside from myself

locator1/locator1.log: [info 2017/06/09 10:13:38.695 PDT locator1 <main> tid=0x1] Starting DistributionManager 10.118.26.122(locator1:9937:locator)<ec><v0>:1024.  (took 9250 ms)

locator1/locator1.log: [info 2017/06/09 10:13:38.697 PDT locator1 <main> tid=0x1] Initial (distribution manager) view =  View[10.118.26.122(locator1:9937:locator)<ec><v0>:1024|0] members: [10.118.26.122(locator1:9937:locator)<ec><v0>:1024]

locator1/locator1.log: [info 2017/06/09 10:13:38.697 PDT locator1 <main> tid=0x1] Admitting member <10.118.26.122(locator1:9937:locator)<ec><v0>:1024>. Now there are 1 non-admin member(s).

locator1/locator1.log: [info 2017/06/09 10:13:38.697 PDT locator1 <main> tid=0x1] 10.118.26.122(locator1:9937:locator)<ec><v0>:1024 is the elder and the only member.

locator1/locator1.log: [info 2017/06/09 10:13:38.700 PDT locator1 <main> tid=0x1] Did not hear back from any other system. I am the first one.

locator1/locator1.log: [info 2017/06/09 10:13:38.726 PDT locator1 <main> tid=0x1] Creating cache for locator.

locator1/locator1.log: [info 2017/06/09 10:13:38.895 PDT locator1 <main> tid=0x1] Requesting cluster configuration

locator1/locator1.log: [info 2017/06/09 10:13:38.982 PDT locator1 <main> tid=0x1] Initializing region _monitoringRegion_10.118.26.122<v0>1024

locator1/locator1.log: [info 2017/06/09 10:13:38.986 PDT locator1 <main> tid=0x1] Initialization of region _monitoringRegion_10.118.26.122<v0>1024 completed

locator2/locator2.log: [info 2017/06/09 10:13:39.103 PDT locator2 <main> tid=0x1] This member is becoming the membership coordinator with address 10.118.26.122(locator2:9961:locator)<ec>:1025

locator2/locator2.log: [info 2017/06/09 10:13:39.103 PDT locator2 <main> tid=0x1] received new view: View[10.118.26.122(locator2:9961:locator)<ec><v0>:1025|0] members: [10.118.26.122(locator2:9961:locator)<ec><v0>:1025]
  old view is: null

locator2/locator2.log: [info 2017/06/09 10:13:39.104 PDT locator2 <main> tid=0x1] Peer locator received new membership view: View[10.118.26.122(locator2:9961:locator)<ec><v0>:1025|0] members: [10.118.26.122(locator2:9961:locator)<ec><v0>:1025]

locator2/locator2.log: [info 2017/06/09 10:13:39.117 PDT locator2 <main> tid=0x1] ViewCreator starting on:10.118.26.122(locator2:9961:locator)<ec><v0>:1025

locator2/locator2.log: [info 2017/06/09 10:13:39.118 PDT locator2 <main> tid=0x1] Finished joining (took 9033ms).

locator2/locator2.log: [info 2017/06/09 10:13:39.119 PDT locator2 <main> tid=0x1] Starting DistributionManager 10.118.26.122(locator2:9961:locator)<ec><v0>:1025.  (took 9247 ms)

locator2/locator2.log: [info 2017/06/09 10:13:39.120 PDT locator2 <Geode Membership View Creator> tid=0x24] View Creator thread is starting

locator2/locator2.log: [info 2017/06/09 10:13:39.121 PDT locator2 <Geode Membership View Creator> tid=0x24] no recipients for new view aside from myself
{noformat}


> Restarting 2 locators within 1s of each other causes potential locator split brain
> ----------------------------------------------------------------------------------
>
>                 Key: GEODE-3052
>                 URL: https://issues.apache.org/jira/browse/GEODE-3052
>             Project: Geode
>          Issue Type: Bug
>          Components: locator
>    Affects Versions: 1.1.1
>            Reporter: Udo Kohlmeyer
>            Assignee: Bruce Schuchardt
>             Fix For: 1.2.0
>
>
> Using the artifacts from GEODE-3003, it is possible to cause a locator split brain upon locator startup. This seems to only happen when the locators start within 1s of each other, i.e <1s.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)