You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@geode.apache.org by "Bruce Schuchardt (JIRA)" <ji...@apache.org> on 2017/05/02 22:04:04 UTC

[jira] [Created] (GEODE-2865) data loss in initial-image replication with multicast

Bruce Schuchardt created GEODE-2865:
---------------------------------------

Summary: data loss in initial-image replication with multicast
Key: GEODE-2865
URL: https://issues.apache.org/jira/browse/GEODE-2865
Project: Geode
Issue Type: Bug
Components: messaging
Reporter: Bruce Schuchardt

During initial image replication ("get initial image") a state-flush operation is performed to ensure that all in-flight operations are applied to the region being replicated prior to replication starting. If multicast is enabled for a region it is currently possible for the state-flush to miss one or more in-flight operations, so that the new repilcate is missing changes that are reflected in the region being replicated.

For example, process A sends a multicast put() replication message to process B. Simultaneously process C is replicating the affected region and performs a state-flush. Process A sends a state-stabilization message to process B noting its multicast channel state (NAKACK2 outgoing message counter). Process B receives this and waits for the multicast channel state to show that it has received all of the messages. Process B then sends a state-stabilized message to process C (the new replicate).

The state-stabilization algorithm in this case is faulty because it is performed in the waiting-thread pool. The algorithm assumes that it is executing in the serial-executor thread pool so that any messages that happened before it have been applied to the region. This can allow messages to have been received and scheduled for the serial-executor but not be applied to the region before replication begins.

The membership manager should be modified to ensure that the serial-executor queue has been flushed before giving the state-flush operation the go-ahead to begin replication.

--
This message was sent by Atlassian JIRA
(v6.3.15#6346)