You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org> on 2018/02/28 08:20:00 UTC
[jira] [Comment Edited] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

    [ https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379927#comment-16379927 ] 

Shalin Shekhar Mangar edited comment on SOLR-12011 at 2/28/18 8:19 AM:
-----------------------------------------------------------------------

Thanks Dat. A few comments:
# The line log.info("skip url:{} cause its term is less than leader", replica.getCoreUrl()); will be logged on each update request during the time the other replicas don't have the same term as leader. Perhaps this should be debug level.
# ElectionContext has {{if (weAreReplacement && isClosed)}}. Did you mean {{!isClosed}}?
# ElectionContext has {{getReplicas(EnumSet.of(Replica.Type.TLOG, Replica.Type.TLOG)}}. Perhaps you meant TLOG and NRT?
# ElectionContext has replaced shouldIBeLeader() which has a check for last published state being active. I'm curious if there can be a condition where the term is not registered and last published state is not active and therefore the replica becomes a leader.
# PrepRecoveryOp refreshes terms if {{shardTerms.skipSendingUpdatesTo(coreNodeName)}} return true. But should it not wait for skip status to go away in a loop? The reason behind PrepRecovery is that we ensure that when the call to prep recovery returns, the leader has already seen the {{waitForState}} state and therefore is already forwarding the updates to the recoverying replica. Now that the behavior is changed to forward updates only after the term is equal and not depend on seeing 'recoverying' state, we should change PrepRecovery as well.
# Add a comment before calling {{getShardTerms(collection, shardId).startRecovering(coreNodeName);}} and {{getShardTerms(collection, shardId).doneRecovering(coreNodeName);}} in ZkController.publish() describing why it is necessary and why only PULL replicas are excluded. I understand the reason but it can be confusing to others reading this code


was (Author: shalinmangar):
Thanks Dat. A few comments:
# The line log.info("skip url:{} cause its term is less than leader", replica.getCoreUrl()); will be logged on each update request during the time the other replicas don't have the same term as leader. Perhaps this should be debug level.
# ElectionContext has {{if (weAreReplacement && isClosed)}}. Did you mean {{!isClosed}}?
# ElectionContext has {{getReplicas(EnumSet.of(Replica.Type.TLOG, Replica.Type.TLOG)}}. Perhaps you meant TLOG and NRT?
# ElectionContext has replaced shouldIBeLeader() which has a check for last published state being active. I'm curious if there can be a condition where the term is not registered and last published state is not active and therefore the replica becomes a leader.
# PrepRecoveryOp has refreshes terms if {{shardTerms.skipSendingUpdatesTo(coreNodeName)}} return true. But should it not wait for skip status to go away in a loop. The reason behind PrepRecovery is that we ensure that when the call to prep recovery returns, the leader has already seen the {{waitForState}} state and therefore is already forwarding the updates to the recoverying replica. Now that the behavior is changed to forward updates only after the term is equal and not depend on seeing 'recoverying' state, we should change PrepRecovery as well.
# Add a comment before calling {{getShardTerms(collection, shardId).startRecovering(coreNodeName);}} and {{getShardTerms(collection, shardId).doneRecovering(coreNodeName);}} in ZkController.publish() describing why it is necessary and why only PULL replicas are excluded. I understand the reason but it can be confusing to others reading this code

> Consistence problem when in-sync replicas are DOWN
> --------------------------------------------------
>
>                 Key: SOLR-12011
>                 URL: https://issues.apache.org/jira/browse/SOLR-12011
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Cao Manh Dat
>            Assignee: Cao Manh Dat
>            Priority: Major
>         Attachments: SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. For example:
> 1. A collection with 1 shard with 1 leader and 2 replicas
> 2. Nodes contain 2 replicas go down
> 3. The leader receives an update A, success
> 4. The node contains the leader goes down
> 5. 2 replicas come back
> 6. One of them become leader --> But they shouldn't become leader since they missed the update A
> A solution to this issue :
> - The idea here is using term value of each replica will be enough to tell that a replica received a latest updates or not



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org