You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Manohar Sripada <ma...@gmail.com> on 2016/07/13 02:32:36 UTC

Zookeeper overseer queue clogging

There are 16 Solr Nodes (Solr 5.2.1) & 5 Zookeeper Nodes (Zookeeper 3.4.6)
in our production cluster. We had to restart Solr nodes for some reason and
we are doing it after 3 months. To our surprise, none of the solr nodes
came up. We can see the Solr process running the machine, but, the Solr
Admin console is not reachable. We even tried restarting Zookeeper cluster
and Solr node cluster. Still, the issue remained.

On debugging I have found out -
1. Below exception in solr.log :


>
>
> *ERROR - 2016-07-12 07:43:48.988;
> org.apache.solr.servlet.SolrDispatchFilter; Could not start Solr. Check
> solr/home property and the logsERROR - 2016-07-12 07:43:49.012;
> org.apache.solr.common.SolrException;
> null:org.apache.solr.common.SolrException: Could not find collection :
> cont_coll_2_fr        at
> org.apache.solr.common.cloud.ClusterState.getCollection(ClusterState.java:164)*


2.  Connected to zookeeper quorum using Zookeeper's zkCli.sh and found out
that there are few collections (which are deleted using Solr Collections
Delete API) still exists in zookeeper (ls /collections). The same
collections doesn't exist on the solr node disk.

3. There are entries related to these deleted collections in Zookeeper's
clusterstate.json file as well.

4. There are many entries in overseer queue (/overseer/queue) & queue-work
(/overseer/queue-work).

I have tried below things based on some existing suggestions on the net  -
1. Stopped all the Solr nodes and removed unwanted (which are deleted using
Solr Collections Delete API) collections using *rmr *command from Zookeeper
(/collections).

2. Removed all the entries from overseer queue (/overseer/queue) &
queue-work (/overseer/queue-work) as well.

3. Restarted Zookeeper and then Solr.

Even, after doing this the issue still remains. Can someone help me on how
to resolve this?

- Thanks