You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org> on 2017/10/11 10:13:00 UTC

[jira] [Commented] (SOLR-11445) Overseer.processQueueItem().... zkStateWriter.enqueueUpdate might ideally have a try{}catch{} around it

    [ https://issues.apache.org/jira/browse/SOLR-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200061#comment-16200061 ] 

Shalin Shekhar Mangar commented on SOLR-11445:
----------------------------------------------

I think it is better that we explicitly check for NoNode or NodeExists exceptions in the isBadMessageOrInvalidState() method. Most other KeeperExceptions shouldn't cause us to poll items off the queue. Also, the same kind of handling should be done for exceptions thrown when processing messages from state update queue.

> Overseer.processQueueItem()....  zkStateWriter.enqueueUpdate might ideally have a try{}catch{} around it
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11445
>                 URL: https://issues.apache.org/jira/browse/SOLR-11445
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 6.6.1, 7.0, master (8.0)
>            Reporter: Greg Harris
>         Attachments: SOLR-11445.patch
>
>
> So we had the following stack trace with a customer:
> 2017-10-04 11:25:30.339 ERROR (xxxx) [ ] o.a.s.c.Overseer Exception in Overseer main queue loop
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/xxxx/state.json
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>     at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>     at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391)
>     at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388)
>     at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
>     at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388)
>     at org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:235)
>     at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:152)
>     at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:271)
>     at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:199)
>     at java.lang.Thread.run(Thread.java:748)
> I want to highlight: 
>   at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:152)
>     at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:271)
> This ends up coming from Overseer:
> while (data != null)  {
>                 final ZkNodeProps message = ZkNodeProps.load(data);
>                 log.debug("processMessage: workQueueSize: {}, message = {}", workQueue.getStats().getQueueLength(), message);
>                 // force flush to ZK after each message because there is no fallback if workQueue items
>                 // are removed from workQueue but fail to be written to ZK
>                 *clusterState = processQueueItem(message, clusterState, zkStateWriter, false, null);
>                 workQueue.poll(); // poll-ing removes the element we got by peek-ing*
>                 data = workQueue.peek();
>               }
> Note: The processQueueItem comes before the poll, therefore upon a thrown exception the same node/message that won't process becomes stuck. This made a large cluster unable to come up on it's own without deleting the problem node. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org