You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/09/16 23:02:00 UTC

[jira] [Commented] (GEODE-8385) hang recovering from disk with cyclic dependencies

    [ https://issues.apache.org/jira/browse/GEODE-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197271#comment-17197271 ] 

ASF subversion and git services commented on GEODE-8385:
--------------------------------------------------------

Commit d49fe266cdbcf01b083944141fe3e667824d27dc in geode's branch refs/heads/support/9.10 from Bruce Schuchardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=d49fe26 ]

GEODE-8385: hang recovering from disk with cyclic dependencies (#5403)

* GEODE-8385: hang recovering from disk with cyclic dependencies

This restores the point at which we notify membership listeners of
departures.  We used to do this (in 1.12 and earlier) when a ShutdownMessage
is received instead of waiting for a new membership view announcing the departure.
Membership views can take some time to form and install, which can cause
persistent (disk store) views to be updated later than they used to be.

In the case of this ticket the disk store of one member was being
closed while another was shutting down.  The member closing its disk
store did not see the view announcing that shutdown until most of its
disk store regions had closed their persistence advisors.  This left the
disk store thinking that the other member was still up at the time it
was closed.

(cherry picked from commit 08316aa05198704d96aefc5497e483052c27a378)


> hang recovering from disk with cyclic dependencies
> --------------------------------------------------
>
>                 Key: GEODE-8385
>                 URL: https://issues.apache.org/jira/browse/GEODE-8385
>             Project: Geode
>          Issue Type: Bug
>          Components: membership, persistence
>            Reporter: Bruce J Schuchardt
>            Assignee: Bruce J Schuchardt
>            Priority: Major
>              Labels: no-release-note, pull-request-available
>             Fix For: 1.13.0, 1.14.0
>
>
> In a test cluster using replicated persistent Regions all of the servers were shut down and restarted.  The restart hung showing a cycle in disk store dependencies.
>  {noformat}
> [info 2020/05/29 03:02:36.635 PDT <Thread-18> tid=0x8f] Region /Region_14 has potentially stale data. It is waiting for another online member to recover the latest data.My persistent id:
>   DiskStore ID: a175354a-d27d-4575-9916-16fd7ff7ea67  Name: persistgemfire4_host1_4194  Location: /10.32.110.100:/var/vcap/data/rundir/concRecoverAllV4O41/concRecoverAll-0529-024642/vm_5_persist4_disk_1
> Members with potentially new data:[  
> DiskStore ID: 2d77752e-507d-4425-a382-a5856c61938f  Name: persistgemfire10_host1_4208  Location: /10.32.110.100:/var/vcap/data/rundir/concRecoverAllV4O41/concRecoverAll-0529-024642/vm_2_persist10_disk_1]
> Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
> {noformat}
> After looking at the logs for all members, the "members with potentially new data" for each member were found to be:
> {noformat}
> Member | Members with potentially new data
> --------+----------------------------------
> 1 | all
> 2 | 4
> 3 | 4
> 4 | 10
> 5 | 2, 3, 4, 8, 10
> 6 | 2, 3, 4, 5, 7, 8, 10
> 7 | 3, 4, 10
> 8 | 3, 4, 10
> 9 | 2, 3, 4, 5, 7, 8, 10
> 10 | 3
> {noformat}
> It appears that there is a cycle in this "waiting for another online member" graph between 3 > 4 > 10 > 3.
> The problem seems to have cropped up after the fix for GEODE-7196 was merged.  That changed the timing of member-departed notifications such that a server might close a Region's Persistence Advisor before getting notification that another server was shutting down.  We used to do this notification upon receipt of a ShutdownMessage but now we only do it when the membership view has changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)