You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ugo Matrangolo (JIRA)" <ji...@apache.org> on 2014/09/24 13:23:35 UTC

[jira] [Comment Edited] (SOLR-5961) Solr gets crazy on /overseer/queue state change

    [ https://issues.apache.org/jira/browse/SOLR-5961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146217#comment-14146217 ] 

Ugo Matrangolo edited comment on SOLR-5961 at 9/24/14 11:23 AM:
----------------------------------------------------------------

Happened again :/

After a routine maintenance of our network causing a 30 secs connectivity hiccup the SOLR cluster started to spam overseer/queue with more than 47k+ events.

{code}
[zk: zookeeper4:2181(CONNECTED) 26] get /gilt/config/solr/overseer/queue
null
cZxid = 0x290008df29
ctime = Fri Aug 29 02:06:47 GMT+00:00 2014
mZxid = 0x290008df29
mtime = Fri Aug 29 02:06:47 GMT+00:00 2014
pZxid = 0x290023cedd
cversion = 60632
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 47822
[zk: zookeeper4:2181(CONNECTED) 27]
{code}

This time we tried to wait for it to heal itself and we watched the numChildren count go down but then up again: no way it was going to fix alone.

As usual we had to shutdown all the cluster, rmr /overseer/queue and restart.

Annoying :/


was (Author: ugo.matrangolo):
Happened again :/

After a routine maintenance of our network causing a 30 secs connectivity hiccup the SOLR cluster started to spam overseer/queue with more than 47k+ events.

{code}
[zk: zookeeper4:2181(CONNECTED) 26] get /gilt/config/solr/overseer/queue
null
cZxid = 0x290008df29
ctime = Fri Aug 29 02:06:47 GMT+00:00 2014
mZxid = 0x290008df29
mtime = Fri Aug 29 02:06:47 GMT+00:00 2014
pZxid = 0x290023cedd
cversion = 60632
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 47822
[zk: zookeeper4:2181(CONNECTED) 27]
{code}

This time we tried to wait for it to heal itself and we watched the numChildren count go down but then up again: no way it was going to fix alone.

As usual we had to shutdown all the cluster, rmr /overseer/queue and restart.

Annoying :/

> Solr gets crazy on /overseer/queue state change
> -----------------------------------------------
>
>                 Key: SOLR-5961
>                 URL: https://issues.apache.org/jira/browse/SOLR-5961
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.7.1
>         Environment: CentOS, 1 shard - 3 replicas, ZK cluster with 3 nodes (separate machines)
>            Reporter: Maxim Novikov
>            Priority: Critical
>
> No idea how to reproduce it, but sometimes Solr stars littering the log with the following messages:
> 419158 [localhost-startStop-1-EventThread] INFO  org.apache.solr.cloud.DistributedQueue  ? LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged
> 419190 [Thread-3] INFO  org.apache.solr.cloud.Overseer  ? Update state numShards=1 message={
>   "operation":"state",
>   "state":"recovering",
>   "base_url":"http://${IP_ADDRESS}/solr",
>   "core":"${CORE_NAME}",
>   "roles":null,
>   "node_name":"${NODE_NAME}_solr",
>   "shard":"shard1",
>   "collection":"${COLLECTION_NAME}",
>   "numShards":"1",
>   "core_node_name":"core_node2"}
> It continues spamming these messages with no delay and the restarting of all the nodes does not help. I have even tried to stop all the nodes in the cluster first, but then when I start one, the behavior doesn't change, it gets crazy nuts with this " /overseer/queue state" again.
> PS The only way to handle this was to stop everything, manually clean up all the data in ZooKeeper related to Solr, and then rebuild everything from scratch. As you should understand, it is kinda unbearable in the production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org