You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Anthony Baker (JIRA)" <ji...@apache.org> on 2018/03/08 16:35:00 UTC

[jira] [Commented] (GEODE-4802) Geode cluster hanged after network problems

    [ https://issues.apache.org/jira/browse/GEODE-4802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391489#comment-16391489 ] 

Anthony Baker commented on GEODE-4802:
--------------------------------------

TCP offers reliable but not guaranteed packet delivery.  If packets get lost along the way they will be retransmitted.

TCP will attempt to deliver packets for a long time (15min by default), based on the settings for {{retries}} and {{retries2}}.  You would need to wait for TCP to *really* declare a packet as lost.  Also, suspect processing for a member in this state won't start until at least 15sec and based on the settings of {{member-timeout}} and {{ack-severe-alert-threshold}}.

Suspect processing in Geode is used to fence off unresponsive members from the cluster.  That allows us to maintain consistency and predictable availability.  Cluster settings can be tuned to meet availability SLA's.


> Geode cluster hanged after network problems
> -------------------------------------------
>
>                 Key: GEODE-4802
>                 URL: https://issues.apache.org/jira/browse/GEODE-4802
>             Project: Geode
>          Issue Type: Bug
>            Reporter: Eugene Nedzvetsky
>            Priority: Major
>         Attachments: clumsy2.jpg, threaddump.log
>
>
> Test preparation:
>  # create file bin/server1/gemfire.properties with property membership-port-range=2025-2030
>  # create file bin/server2/gemfire.propertieswith property membership-port-range=2035-2040
>  # Download network problems emulator [https://jagt.github.io/clumsy]
>  # Fill field 'filtering' in Clumsy: tcp and (tcp.DstPort == 2025 or tcp.DstPort == 2026 or tcp.DstPort == 2027 or tcp.DstPort == 2028 or tcp.DstPort == 2029 or tcp.DstPort == 2030). Select function 'Drop' and set Chance=100%. See clumsy2.jpg
> Steps to reproduce
>  # Start gfsh
>  # start locator --name=locator1
>  # start server --name=server1 --server-port=40411
>  # start server --name=server2 --server-port=40412
>  # create region --name=regionA --type=REPLICATE
>  # put --region=regionA --key="1" --value="one"
>  # Click on 'start' button in Clumsy
>  # put --region=regionA --key="1" --value="onev2"
>  # Wait *15s* and click on 'stop' in Clumsy
> Gfsh console has hung.
> bin\server1\server1.log:
> [warning 2018/03/07 18:02:50.360 PST server1 <Function Execution Processor1> tid=0x4b] 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 22 waiting for 1 replies from [192.168.100.109(server2:12804)<v2>:2035]> on 192.168.100.109(server1:14416)<v1>:2045 whose current membership list is: [[192.168.100.109(server2:12804)<v2>:2035, 192.168.100.109(locator1:15628:locator)<ec><v0>:1024, 192.168.100.109(server1:14416)<v1>:2045]]
> Pulse has shown 'normal' status for both servers.
> Gfsh works again if server1 process was killed.
> Also  i've reproduced another issue with the same scenario on my test environment(see [^threaddump.log])
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)