You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Scott Blum (JIRA)" <ji...@apache.org> on 2015/08/05 00:30:04 UTC

[jira] [Commented] (SOLR-7869) Overseer does not handle BadVersionException correctly

    [ https://issues.apache.org/jira/browse/SOLR-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654472#comment-14654472 ] 

Scott Blum commented on SOLR-7869:
----------------------------------

But what's the right fix?  Having looked through the code a bit now, OverSeer.ClusterUpdater has a *very* baked-in assumption that no one else is updating cluster state.  Copies of ClusterState float around and get updated over and over during processing, with the assumption that the local node is performing an atomic sequence of operations to get to a desired end state.  How can external changes be merged in?  My impulse was to catch BadVersionException, refresh ClusterState from ZK, then re-apply all the queued updates against the refreshed state.  However, I'm afraid that approach violates all of ClusterUpdater's assumptions.  I think the only thing to do is just clobber whatever is in ZK with what Overseer wants to write, even though that seems less than ideal.

> Overseer does not handle BadVersionException correctly
> ------------------------------------------------------
>
>                 Key: SOLR-7869
>                 URL: https://issues.apache.org/jira/browse/SOLR-7869
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.2.1
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>              Labels: difficulty-medium, impact-low
>             Fix For: 5.3, Trunk
>
>
> If the /clusterstate.json is modified externally then the Overseer can go into an infinite loop upon a BadVersionException alternately trying to execute main queue and then the work queue:
> {code}
> ERROR - 2015-08-04 18:49:56.224; [   ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer work queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /clusterstate.json
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>         at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
>         at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
>         at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
>         at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
>         at org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
>         at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
>         at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
>         at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:168)
>         at java.lang.Thread.run(Thread.java:745)
> INFO  - 2015-08-04 18:49:56.224; [   ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 1, message = {
>   "operation":"state",
>   "state":"down",
>   "base_url":"http://127.0.1.1:7574/solr",
>   "core":"test_shard1_replica1",
>   "roles":null,
>   "node_name":"127.0.1.1:7574_solr",
>   "shard":null,
>   "collection":"test",
>   "core_node_name":"core_node1"} current state version: 9
> INFO  - 2015-08-04 18:49:56.224; [   ] org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=null message={
>   "operation":"state",
>   "state":"down",
>   "base_url":"http://127.0.1.1:7574/solr",
>   "core":"test_shard1_replica1",
>   "roles":null,
>   "node_name":"127.0.1.1:7574_solr",
>   "shard":null,
>   "collection":"test",
>   "core_node_name":"core_node1"}
> INFO  - 2015-08-04 18:49:56.224; [   ] org.apache.solr.cloud.overseer.ReplicaMutator; shard=shard1 is already registered
> ERROR - 2015-08-04 18:49:56.225; [   ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer main queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /clusterstate.json
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>         at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
>         at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
>         at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
>         at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
>         at org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
>         at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
>         at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
>         at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:213)
>         at java.lang.Thread.run(Thread.java:745)
> INFO  - 2015-08-04 18:49:56.225; [   ] org.apache.solr.common.cloud.ZkStateReader; Updating data for gettingstarted to ver 8
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org