You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael (JIRA)" <ji...@apache.org> on 2018/10/10 15:02:00 UTC

[jira] [Commented] (SOLR-8868) SolrCloud: if zookeeper loses and then regains a quorum, Solr nodes and SolrJ Client do not recover and need to be restarted

    [ https://issues.apache.org/jira/browse/SOLR-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645105#comment-16645105 ] 

Michael commented on SOLR-8868:
-------------------------------

Same issue on SolrCloud 7.3.1 on Kubernetes.

> SolrCloud: if zookeeper loses and then regains a quorum, Solr nodes and SolrJ Client do not recover and need to be restarted
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8868
>                 URL: https://issues.apache.org/jira/browse/SOLR-8868
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud, SolrJ
>    Affects Versions: 5.3.1
>            Reporter: Frank J Kelly
>            Priority: Major
>
> Tried mailing list on 3/15 and 3/16 to no avail. Hopefully I gave enough details.
> ----
> Just wondering if my observation of SolrCloud behavior after ZooKeeper loses a quorum is normal or to-be-expected
> Version of Solr: 5.3.1
> Version of ZooKeeper: 3.4.7
> Using SolrCloud with external ZooKeeper
> Deployed on AWS
> Our Solr cluster has 3 nodes (m3.large)
> Our Zookeeper ensemble consists of three nodes (t2.small) with the same config using DNS names e.g.
> {noformat}
> $ more ../conf/zoo.cfg
> tickTime=2000
> dataDir=/var/zookeeper
> dataLogDir=/var/log/zookeeper
> clientPort=2181
> initLimit=10
> syncLimit=5
> standaloneEnabled=false
> server.1=zookeeper1.qa.eu-west-1.mysearch.com:2888:3888
> server.2=zookeeper2.qa.eu-west-1.mysearch.com:2888:3888
> server.3=zookeeper3.qa.eu-west-1.mysearch.com:2888:3888
> {noformat}
> If we terminate one of the zookeeper nodes we get a ZK election (and I think) a quorum is maintained.
> Operation continues OK and we detect the terminated instance and relaunch a new ZK node which comes up fine
> If we terminate two of the ZK nodes we lose a quorum and then we observe the following
> 1.1) Admin UI shows an error that it is unable to contact ZooKeeper “Could not connect to ZooKeeper"
> 1.2) SolrJ returns the following
> {noformat}
> org.apache.solr.common.SolrException: Could not load collection from ZK:qa_eu-west-1_public_index
> at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850)
> at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515)
> at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205)
> at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837)
> at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805)
> at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86)
> at com.here.scbe.search.solr.SolrFacadeImpl.addToSearchIndex(SolrFacadeImpl.java:112)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections/qa_eu-west-1_public_index/state.json
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
> at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
> at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841)
> ... 24 more
> {noformat}
> This makes sense based on our understanding.
> When our AutoScale groups launch two new ZooKeeper nodes, initialize them, fix the DNS etc. we regain a quorum but at this point
> 2.1) Admin UI shows the shards as “GONE” (all greyed out)
> 2.2) SolrJ returns the same error even though the ZooKeeper DNS names are now bound to new IP addresses
> So at this point I restart the Solr nodes. At this point then
> 3.1) Admin UI shows the collections as OK (all shards are green) – yeah the nodes are back!
> 3.2) SolrJ Client still shows the same error – namely
> {noformat}
> org.apache.solr.common.SolrException: Could not load collection from ZK:qa_eu-west-1_here_account
> at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850)
> at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515)
> at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205)
> at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837)
> at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805)
> at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
> at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)
> at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)
> at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)
> at com.here.scbe.search.solr.SolrFacadeImpl.deleteById(SolrFacadeImpl.java:257)
> .
> .
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections/qa_eu-west-1_here_account/state.json
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
> at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
> at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841)
> {noformat}
> Is this behavior (lack of self-healing) a known and expected behavior?
> If this is expected behavior then likely this should be recast as an Improvement request?
> Is this the same or similar behavior as documented here https://issues.apache.org/jira/browse/SOLR-5129
> p.s. I can add Solr log files if they will help



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org