You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@geode.apache.org by "Bruce Schuchardt (JIRA)" <ji...@apache.org> on 2017/05/15 19:57:04 UTC

[jira] [Resolved] (GEODE-2865) data loss in initial-image replication with multicast

     [ https://issues.apache.org/jira/browse/GEODE-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bruce Schuchardt resolved GEODE-2865.
-------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.2.0

> data loss in initial-image replication with multicast
> -----------------------------------------------------
>
>                 Key: GEODE-2865
>                 URL: https://issues.apache.org/jira/browse/GEODE-2865
>             Project: Geode
>          Issue Type: Bug
>          Components: messaging
>            Reporter: Bruce Schuchardt
>             Fix For: 1.2.0
>
>
> During initial image replication ("get initial image") a state-flush operation is performed to ensure that all in-flight operations are applied to the region being replicated prior to replication starting.  If multicast is enabled for a region it is currently possible for the state-flush to miss one or more in-flight operations, so that the new repilcate is missing changes that are reflected in the region being replicated.
> For example, process A sends a multicast put() replication message to process B.  Simultaneously process C is replicating the affected region and performs a state-flush.  Process A sends a state-stabilization message to process B noting its multicast channel state (NAKACK2 outgoing message counter).  Process B receives this and waits for the multicast channel state to show that it has received all of the messages.  Process B then sends a state-stabilized message to process C (the new replicate).
> The state-stabilization algorithm in this case is faulty because it is performed in the waiting-thread pool.  The algorithm assumes that it is executing in the serial-executor thread pool so that any messages that happened before it have been applied to the region.  This can allow messages to have been received and scheduled for the serial-executor but not be applied to the region before replication begins.
> The membership manager should be modified to ensure that the serial-executor queue has been flushed before giving the state-flush operation the go-ahead to begin replication.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)