You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Bruce Schuchardt (JIRA)" <ji...@apache.org> on 2019/08/02 18:39:00 UTC

[jira] [Resolved] (GEODE-7031) Attempts to send messages to alert listeners delays network partition detection

     [ https://issues.apache.org/jira/browse/GEODE-7031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bruce Schuchardt resolved GEODE-7031.
-------------------------------------
    Resolution: Fixed

> Attempts to send messages to alert listeners delays network partition detection
> -------------------------------------------------------------------------------
>
>                 Key: GEODE-7031
>                 URL: https://issues.apache.org/jira/browse/GEODE-7031
>             Project: Geode
>          Issue Type: Improvement
>          Components: membership
>            Reporter: Bruce Schuchardt
>            Assignee: Bruce Schuchardt
>            Priority: Major
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> In a number of recent regression test runs in AWS we have seen network partition detection tests fail to detect the partition in a reasonable amount of time.  Logs show membership services attempting to send alerts to other processes that are no longer reachable.  Each attempt takes 6 * the member-timeout setting - that's 30 seconds for each attempt.  It would be nice to have a different connection-formation timeout for something like this since alert notification is built into the logging system that membership services have to use.  Since the alert system is also dependent on membership services functioning properly this creates a circular dependency that has historically caused hangs and delays such as the one described here.
> {noformat}
> [debug 2019/07/29 14:35:03.824 PDT <Geode Failure Detection thread 5> tid=0xc3] Sending (Alert "Unable to send message to 10.32.108.136(gemfire3_host2_12249:12249)<v3>:41003" level WARNING) to 1 peers ([10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001]) via tcp/ip
> [debug 2019/07/29 14:35:03.825 PDT <Geode Failure Detection thread 5> tid=0xc3] created PendingConnection org.apache.geode.internal.tcp.ConnectionTable$PendingConnection@4f4c8630 created by Geode Failure Detection thread 5
> [info 2019/07/29 14:35:33.847 PDT <Geode Failure Detection thread 5> tid=0xc3] Connection: shared=true ordered=true failed to connect to peer 10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001 because: java.net.SocketTimeoutException
> [debug 2019/07/29 14:35:33.852 PDT <Geode Failure Detection thread 5> tid=0xc3] Giving up connecting to alert listener 10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001{noformat}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)