You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Commit Tag Bot (JIRA)" <ji...@apache.org> on 2013/03/22 17:41:16 UTC
[jira] [Commented] (SOLR-3812) ConnectionLoss during recovery can cause lost updates, leading to shard inconsistency.

    [ https://issues.apache.org/jira/browse/SOLR-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13610846#comment-13610846 ] 

Commit Tag Bot commented on SOLR-3812:
--------------------------------------

[branch_4x commit] Mark Robert Miller
http://svn.apache.org/viewvc?view=revision&revision=1384937

SOLR-3833: When a election is started because a leader went down, the new leader candidate should decline if the last state they published was not active.

SOLR-3836: When doing peer sync, we should only count sync attempts that cannot reach the given host as success when the candidate leader is syncing with the replicas - not when replicas are syncing to the leader.

SOLR-3835: In our leader election algorithm, if on connection loss we found we did not create our election node, we should retry, not throw an exception.

SOLR-3834: A new leader on cluster startup should also run the leader sync process in case there was a bad cluster shutdown.

SOLR-3772: On cluster startup, we should wait until we see all registered replicas before running the leader process - or if they all do not come up, N amount of time.
  
SOLR-3756: If we are elected the leader of a shard, but we fail to publish this for any reason, we should clean up and re trigger a leader election.

SOLR-3812: ConnectionLoss during recovery can cause lost updates, leading to shard inconsistency.
  
SOLR-3813: When a new leader syncs, we need to ask all shards to sync back, not just those that are active.

SOLR-3807: Currently during recovery we pause for a number of seconds after waiting for the leader to see a recovering state so that any previous updates will have finished before our commit on the leader - we don't need this wait for peersync.
  
SOLR-3837: When a leader is elected and asks replicas to sync back to him and that fails, we should ask those nodes to recovery asynchronously rather than synchronously.

                
> ConnectionLoss during recovery can cause lost updates, leading to shard inconsistency.
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-3812
>                 URL: https://issues.apache.org/jira/browse/SOLR-3812
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: 4.0, 5.0
>
>
> When we lose our connection to ZooKeeper due to connectionloss (that does not lead to expiration), we can drop updates when replaying buffered updates and think we have successfully recovered.
> We need to detect this and fail recovery when it happens. We should also increase how long we wait for re connection when an update comes and we have lost our connection to zk (up to the session timeout).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org