You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Varun Thacker (JIRA)" <ji...@apache.org> on 2017/12/08 07:09:00 UTC
[jira] [Updated] (SOLR-11685) CollectionsAPIDistributedZkTest.testCollectionsAPI fails regularly with leader mismatch

     [ https://issues.apache.org/jira/browse/SOLR-11685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Varun Thacker updated SOLR-11685:
---------------------------------
    Attachment: jenkins_master_7045.log

Analysis from jenkins_master_7045.log

L20542: Test CollectionsAPIDistributedZkTest.testCollectionsAPI starts at line 20542
Question: Why is halfcollectionblocker being deleted after this test has started and not before?
L20563: create collection awhollynewcollection_0 with 4 shards and 1 replica
L20746: ChaosMonkey monkey: stop jetty! 49379
L20774: This shoes that the jetty that was shut down has core=awhollynewcollection_0_shard3_replica_n4
L20781: ChaosMonkey monkey: starting jetty! 49379
L20859: Exception causing close of session 0x1603371da360011 due to java.io.IOException: ZooKeeperServer not running/Watch limit violations
L20889: Restarting zookeeper
L20915: An add request comes in "ClusterState says we are the leader, but locally we don't think so" for awhollynewcollection_0_shard3_replica_n4


Presumably when CloudSolrClient sends the request awhollynewcollection_0_shard3_replica_n4 was the leader of shard3. After the restart it hasn't become leader yet but there are no other replicas. 

CloudSolrClient should catch this exception as it's local cache might not be the most updated one, refresh it state and retry the add operation. Today CloudSolrClient retries in {{requestWithRetryOnStaleState}} when {{wasCommError}} is true . DistributedUpdateProcessor#doDefensiveChecks throws this as a SolrException . We should throw this as another exception on which we can retry the operation


> CollectionsAPIDistributedZkTest.testCollectionsAPI fails regularly with leader mismatch
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-11685
>                 URL: https://issues.apache.org/jira/browse/SOLR-11685
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Varun Thacker
>         Attachments: jenkins_7x_257.log, jenkins_master_7045.log
>
>
> I've been noticing lots of failures on Jenkins where the document add get's rejected because of leader conflict and throws an error like 
> {code}
> ClusterState says we are the leader (https://127.0.0.1:38715/solr/awhollynewcollection_0_shard2_replica_n2), but locally we don't think so. Request came from null
> {code}
> Scanning Jenkins logs I see that these failures have increased since Sept 28th and has been failing daily.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org