You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Grzegorz Huber <gr...@gmail.com> on 2016/09/02 09:53:47 UTC

SOLR replication: different behavior for network cut off vs. machine restart

Hi,

We try to set up a SOLR Cloud environment using 1 shard with 2
replicas (1 leader). The replicas are managed by 3 zookeeper
instances.

The setup seems fine when we do the normal work. The data is being
replicated at runtime.

Now we try to simulate erroneous behavior in several cases:

Turn off one of the replicas in two different scenarios: leader and non-leader
Cutting off the network making the non-leader replica down

In both cases the data is being written contentiously to the SOLR Cloud.

CASE 1: The replication process starts after the failed machine gets
boot up again. The complete data set is present in both replicas.
Everything works fine.

CASE 2: Once reconnected to network the non-leader replica starts the
recovery process ,but for some reason the new data from leader is not
being replicated onto the previously failed replica.

From what I was able to read from logs comparing both cases I don't
understand why SOLR sees

RecoveryStrategy ###### currentVersions as present and
RecoveryStrategy ###### startupVersions=[[]] (empty)

compared to CASE 1 when RecoveryStrategy ###### startupVersions are
filled with objects that are in currentVersions in CASE 2

The general question is... why restarting SOLR results in a successful
migration process, but reconnecting the network does not?

Thanks for any tips / leads!

Cheers,
Greg