You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Erick Erickson (Jira)" <ji...@apache.org> on 2019/12/30 14:05:00 UTC

[jira] [Comment Edited] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)

    [ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17005322#comment-17005322 ] 

Erick Erickson edited comment on SOLR-13486 at 12/30/19 2:04 PM:
-----------------------------------------------------------------

[~ab] Yes, I do get that error on some failed tests. I notice that this can be logged in two places, and the second one (AutoScalingSnitch.getRemoteInfo) throws an explicit exception. My guess is that one should continue to throw a SERVER_ERROR, exception correct?

[~dweiss] I'm seeing two distinct errors now, but neither one of them is exactly yours. One is coming from assertDocExists like yours, but the root cause seems to be "Server refused connection". Do you see that in your stack traces?

The other one I'm getting is a timeout waiting for the collection to become active.

I'm getting about a 130 of these, or about 65 retries for each replica trying to recover:
{code}
109782 ERROR (recoveryExecutor-173-thread-1-processing-n:127.0.0.1:64320_solr x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n3 c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node4) [n:127.0.0.1:64320_solr c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node4 x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n3 ] o.a.s.c.RecoveryStrategy Failed to connect leader http://127.0.0.1:64314/solr on recovery, try again
{code}

This comes from RecoveryStrategy, which pauses half a second then rebuilds the connection, so I'm a bit mystified. I wonder if this has anything to do with the proxy in the test...




was (Author: erickerickson):
[~ab] Yes, I do get that error on some failed tests. I notice that this can be logged in two places, and the second one (AutoScalingSnitch.getRemoteInfo) throws an explicit exception. My guess is that one should continue to throw a SERVER_ERROR, exception correct?

[~dweiss] I'm seeing two distinct errors now, but neither one of them is exactly yours. One is coming from assertDocExists like yours, but the root cause seems to be "Server refused connection". Do you see that in your stack traces?

The other one I'm getting is a timeout waiting for the collection to become active.

Digging more...

> race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13486
>                 URL: https://issues.apache.org/jira/browse/SOLR-13486
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, org.apache.solr.cloud.TestCloudConsistency.zip
>
>
> I've been investigating some jenkins failures from TestCloudConsistency, which at first glance suggest a problem w/replica(s) recovering after a network partition from the leader - but in digging into the logs the root cause acturally seems to be a thread race conditions when a replica (the leader) is first registered...
>  * The {{ZkContainer.registerInZk(...)}} method (which is called by {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically run in a background thread (via the {{ZkContainer.coreZkRegister}} ExecutorService)
>  * {{ZkContainer.registerInZk(...)}} delegates to {{ZKController.register(...)}} which is ultimately responsible for checking if there are any "old" tlogs on disk, and if so handling the "Replaying tlog for <URL> during startup" logic
>  * Because this happens in a background thread, other logic/requests can be handled by this core/replica in the meantime - before it starts (or while in the middle of) replaying the tlogs
>  ** Notably: *leader's that have not yet replayed tlogs on startup will erroneously respond to RTG / Fingerprint / PeerSync requests from other replicas w/incomplete data*
> ...In general, it seems scary / fishy to me that a replica can (aparently) become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog ... during startup" logic ... particularly since this can happen even for replicas that are/become leaders. It seems like this could potentially cause a whole host of problems, only one of which manifests in this particular test failure:
>  * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog ... during startup" check:
>  ** replicaX can recognize (via zk terms) that it should be the leader(X)
>  ** this leaderX can then instruct some other replicaY to recover from it
>  ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX (either on it's own volition, or because it was instructed to by leaderX) in an attempt to recover
>  *** the responses to these recovery requests will not include updates in the tlog files that existed on leaderX prior to startup that hvae not yet been replayed
>  * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... during startup" can finish
>  ** replicaY now thinks it is in sync with leaderX, but leaderX has (replayed) updates the other replicas know nothing about



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org