You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Erick Erickson (Jira)" <ji...@apache.org> on 2020/07/24 11:24:00 UTC

[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)

    [ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164379#comment-17164379 ] 

Erick Erickson commented on SOLR-13486:
---------------------------------------

This is at least in the same ballpark, whether it's the same root cause is TBD

> race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestTlogReplayVsRecovery)
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13486
>                 URL: https://issues.apache.org/jira/browse/SOLR-13486
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-13486__test.patch, apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, org.apache.solr.cloud.TestCloudConsistency.zip
>
>
> There is a bug in solr cloud that can result in replicas being out of sync with the leader if:
>  * The leader has uncommitted docs (in the tlog) that didn't make it to the replica
>  * The leader restarts
>  * The replica begins to peer sync from the leader before the leader finishes it's own tlog replay on startup
> A "rolling restart" situation is when this is most likeley to affect real world users
> This was first discovered via hard to reproduce TestCloudConsistency failures in jenkins, but that test has since been modified to work around this bug, and a new test "TestTlogReplayVsRecovery" has been added that more aggressively demonstrates this error.
> Original jira description below...
> ----
> I've been investigating some jenkins failures from TestCloudConsistency, which at first glance suggest a problem w/replica(s) recovering after a network partition from the leader - but in digging into the logs the root cause acturally seems to be a thread race conditions when a replica (the leader) is first registered...
>  * The {{ZkContainer.registerInZk(...)}} method (which is called by {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically run in a background thread (via the {{ZkContainer.coreZkRegister}} ExecutorService)
>  * {{ZkContainer.registerInZk(...)}} delegates to {{ZKController.register(...)}} which is ultimately responsible for checking if there are any "old" tlogs on disk, and if so handling the "Replaying tlog for <URL> during startup" logic
>  * Because this happens in a background thread, other logic/requests can be handled by this core/replica in the meantime - before it starts (or while in the middle of) replaying the tlogs
>  ** Notably: *leader's that have not yet replayed tlogs on startup will erroneously respond to RTG / Fingerprint / PeerSync requests from other replicas w/incomplete data*
> ...In general, it seems scary / fishy to me that a replica can (aparently) become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog ... during startup" logic ... particularly since this can happen even for replicas that are/become leaders. It seems like this could potentially cause a whole host of problems, only one of which manifests in this particular test failure:
>  * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog ... during startup" check:
>  ** replicaX can recognize (via zk terms) that it should be the leader(X)
>  ** this leaderX can then instruct some other replicaY to recover from it
>  ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX (either on it's own volition, or because it was instructed to by leaderX) in an attempt to recover
>  *** the responses to these recovery requests will not include updates in the tlog files that existed on leaderX prior to startup that hvae not yet been replayed
>  * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... during startup" can finish
>  ** replicaY now thinks it is in sync with leaderX, but leaderX has (replayed) updates the other replicas know nothing about



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org