You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Soubhik Chakraborty (JIRA)" <ji...@apache.org> on 2016/03/14 11:33:33 UTC

[jira] [Created] (GEODE-1088) shutdown-all should skip member dependency checks when restarted

Soubhik Chakraborty created GEODE-1088:
------------------------------------------

             Summary: shutdown-all should skip member dependency checks when restarted
                 Key: GEODE-1088
                 URL: https://issues.apache.org/jira/browse/GEODE-1088
             Project: Geode
          Issue Type: Improvement
          Components: management
            Reporter: Soubhik Chakraborty


Right now a Geode cluster when started, it waits for other members to start (for persistent regions only). These members are recorded when this member is stopped via individual stop or as part of shutdown-all.

Because {code}shutdown-all{code} indicates the entire cluster is going down and if incoming traffic is stopped first, all cluster members can be gauranteed to be in a consistent state while its stopped. Therefore, members stopped cleanly using shutdown-all can skip member dependency checks while starting up.

A more detailed proposition is listed in following ticket
https://snappydata.atlassian.net/browse/SNAP-586

I need team's help (esp. [~upthewaterspout], [~bschuchardt]) to share any insight, pitfalls they see in the proposition. Listing the proposed sequence of steps here for reference.


There are 2 main cases we need to tackle.

# make shutdown-all two phase (assuming all members are healthy)
  #* Phase-I ; stop network interfaces of all servers (via p2p messaging)
  #* wait for inflight operations to complete viz.
    #*# ongoing commits ? (note: due to n/w stop user will already see failure)
    #*# restrict new commits (n/w stopped already, so new commits won't arrive)
    #*# rollback existing transactions (as new commit/rollback won't come from user)
    #*# introduce an op counter and monitor it for zero on each member for non-tx operations (distribution stats counter can be used ?)
    #*# invoke disk sync procedure ?
  #* Phase-II : trigger shutdown on each of the VMs (via p2p messaging)
    #** right now during shutdown-all there are lots of chatter at jgroups level suspecting each other. should it be attempted to avoid ?
  #* skip member dependency check during restart by reading a recorded entry somewhere (data dictionary ?)
# if one or more members are unreachable (hunged member), only way remains is to shutdown via script. 
  #* Need to think more on how to recognize hunged members and what should be done before "kill -9" like record those member list.
  #* these recorded members should be started at last after starting all those members which did shutdown cleanly.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)