You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Pushkar Raste (JIRA)" <ji...@apache.org> on 2016/08/26 15:50:20 UTC
[jira] [Commented] (SOLR-9446) Just replicated index goes into
replication recovery on leader failure even there index was not changed
[ https://issues.apache.org/jira/browse/SOLR-9446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15439296#comment-15439296 ]
Pushkar Raste commented on SOLR-9446:
-------------------------------------
I can think of couple of ways two solve it using fingerprint comparsion
# Add a fingerprint check in {{SyncStratergy.syncToMe()}} and request replica to sync only if fingperint does not match
# Add a fingerprint check in {{RecoveryStratergy.doRecovery()}} and initiate recovery only if fingerprint check does not match
# Add a fingerprint check in {{PeerSync.sync()}} to check if we are already in sync
I think we almost always try PeerSync before trying replication so *#3*, should work.
> Just replicated index goes into replication recovery on leader failure even there index was not changed
> -------------------------------------------------------------------------------------------------------
>
> Key: SOLR-9446
> URL: https://issues.apache.org/jira/browse/SOLR-9446
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: replication (java)
> Reporter: Pushkar Raste
> Priority: Minor
>
> We noticed this issue while migrating solr index from machines {{A1, A2 and A3}} to {{B1, B2, B3}}. We followed following steps (and there were no updates during the migration process).
> * Index had replicas on machines {{A1, A2, A3}}. Let's say {{A1}} was the leader at the time
> * We added 3 more replicas {{B1, B2 and B3}}. These nodes synced with the by replication. These fresh nodes do not have tlogs.
> * We shut down one of the old nodes ({{A3}}).
> * We then shut down the leader ({{A1}})
> * New leader got elected (let's say {{A2}}) became the new leader
> * Leader asked all the replicas to sync with it
> * Fresh nodes (ones without tlogs), first tried PeerSync but since there was no frame of reference, PeerSync failed and fresh nodes fail back on to try replication
> Although replication would not copy all the segments again, it seems like we can short circuit sync to put nodes back in active state as soon as possible.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org