You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/02/01 17:11:00 UTC

[jira] [Commented] (GEODE-8901) Surviving side server forcefully disconnected after network drop

    [ https://issues.apache.org/jira/browse/GEODE-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276483#comment-17276483 ] 

ASF GitHub Bot commented on GEODE-8901:
---------------------------------------

kamilla1201 opened a new pull request #5984:
URL: https://github.com/apache/geode/pull/5984


   Thank you for submitting a contribution to Apache Geode.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
   
   - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   - [ ] Does `gradlew build` run cleanly?
   
   - [ ] Have you written or updated unit tests to verify your changes?
   
   - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   
   ### Note:
   Please ensure that once the PR is submitted, check Concourse for build issues and
   submit an update to your PR as soon as possible. If you need help, please send an
   email to dev@geode.apache.org.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Surviving side server forcefully disconnected after network drop
> ----------------------------------------------------------------
>
>                 Key: GEODE-8901
>                 URL: https://issues.apache.org/jira/browse/GEODE-8901
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>    Affects Versions: 1.14.0
>            Reporter: Kamilla Aslami
>            Assignee: Kamilla Aslami
>            Priority: Major
>
> During a network partition, locator-0 and server-0 were partitioned from the other members of the DS (locator-1, server-1, server-2 (leadMember), server-3). We see the expected "Operation not permitted" Exceptions (in locator-0) for the 4 surviving side members:
>  
> {code:java}
> [warn 2020/12/16 23:14:02.827 GMT <Geode Failure Detection thread 2> tid=0x78] Unable to send message to 10.108.1.130(gemfire-cluster-server-2:1)<v2>:41000
> java.io.IOException: Operation not permitted
> [warn 2020/12/16 23:14:02.938 GMT <Geode Heartbeat Sender> tid=0x22] Unable to send message to 10.108.3.134(gemfire-cluster-locator-1:1:locator)<ec><v0>:41000
> java.io.IOException: Operation not permitted
> [warn 2020/12/16 23:14:06.701 GMT <Geode Membership View Creator> tid=0x79] Unable to send message to 10.108.3.135(gemfire-cluster-server-1:1)<v4>:41000
> java.io.IOException: Operation not permitted
> [warn 2020/12/16 23:14:10.322 GMT <Geode Failure Detection thread 3> tid=0x7a] Unable to send message to 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000
> java.io.IOException: Operation not permitted
> {code}
> As expected, we see the loss of quorum:
> {noformat}
> [warn 2020/12/16 23:14:11.718 GMT <Geode Membership View Creator> tid=0x79] total weight lost in this view change is 28 of 51.  Quorum has been lost!{noformat}
> However, we expected to see a lost weight of 38 (10 + 15 + 10 + 3) for server-1, server-2, server-3 and locator-1, respectively. What we do see is that server-3 gets forcefully disconnected as well – that might occur because after the "Operation not permitted" Exception above, we pass an availability check.
> {noformat}
> [info 2020/12/16 23:14:10.323 GMT <Geode Failure Detection thread 3> tid=0x7a] Performing availability check for suspect member 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000 reason=Unable to send messages to this member via JGroups
> ...
> [warn 2020/12/16 23:14:11.711 GMT <Geode Membership View Creator> tid=0x79] these members failed to respond to the view change: [10.108.3.134(gemfire-cluster-locator-1:1:locator)<ec><v0>:41000, 10.108.3.135(gemfire-cluster-server-1:1)<v4>:41000, 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000, 10.108.1.130(gemfire-cluster-server-2:1)<v2>:41000]
> [info 2020/12/16 23:14:11.714 GMT <Geode View Creator verification thread 1> tid=0x7c] checking state of member 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000
> [info 2020/12/16 23:14:11.714 GMT <Geode View Creator verification thread 1> tid=0x7c] member 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000 passed availability check{noformat}
> This issue looks similar to GEODE-8721 which has been fixed in b7afc604b9c2fafe4388dcdcf05fc7ec49c0ce86, but the failure logs don't contain the logging relevant to GEODE-8721:
> {noformat}
> Availability check detected recent message traffic for suspect member{noformat}
> This has a time stamp showing the time of contact. In GEODE-8721 we see the timestamp being continually updated.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)