You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Michael Roberts <mr...@tableau.com> on 2015/01/22 19:42:31 UTC

SolrCloud timing out marking node as down during startup.

Hi,

I'm seeing some odd behavior that I am hoping someone could explain to me.

The configuration I'm using to repro the issue, has a ZK cluster and a single Solr instance. The instance has 10 Cores, and none of the cores are sharded.

The initial startup is fine, the Solr instance comes up and we build our index. However if the Solr instance exits uncleanly (killed rather than sent a SIGINT), the next time it starts I see the following in the logs.

2015-01-22 09:56:23.236 -0800 (,,,) localhost-startStop-1 : INFO org.apache.solr.common.cloud.ZkStateReader - Updating cluster state from ZooKeeper...
2015-01-22 09:56:30.008 -0800 (,,,) localhost-startStop-1-EventThread : DEBUG org.apache.solr.common.cloud.SolrZkClient - Submitting job to respond to event WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/live_nodes
2015-01-22 09:56:30.008 -0800 (,,,) zkCallback-2-thread-1 : DEBUG org.apache.solr.common.cloud.ZkStateReader - Updating live nodes... (0)
2015-01-22 09:57:24.102 -0800 (,,,) localhost-startStop-1 : WARN org.apache.solr.cloud.ZkController - Timed out waiting to see all nodes published as DOWN in our cluster state.
2015-01-22 09:57:24.102 -0800 (,,,) localhost-startStop-1 : INFO org.apache.solr.cloud.ZkController - Register node as live in ZooKeeper:/live_nodes/10.18.8.113:11000_solr
My question is about "Timed out waiting to see all nodes published as DOWN in our cluster state."

Cursory look at the code, we seem to iterate through all Collections/Shards, and mark the state as Down. These notifications are offered to the Overseer, who I believe updates the ZK state. We then wait for the ZK state to update, with the 60 second timeout.

However, it looks like the Overseer is not started until after we wait for the timeout. So, in a single instance scenario we'll always have to wait for the timeout.

Is this the expected behavior (and just a side effect of running a single instance in cloud mode), or is my understanding of the Overseer/Zk relationhip incorrect?

Thanks.

.Mike

Re: SolrCloud timing out marking node as down during startup.

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Hi Mike,

This is a bug which was fixed in Solr 4.10.3 via
http://issues.apache.org/jira/browse/SOLR-6610 and it slows down cluster
restarts. Since you have a single node cluster, you will run into it on
every restart.

On Thu, Jan 22, 2015 at 6:42 PM, Michael Roberts <mr...@tableau.com>
wrote:

> Hi,
>
> I'm seeing some odd behavior that I am hoping someone could explain to me.
>
> The configuration I'm using to repro the issue, has a ZK cluster and a
> single Solr instance. The instance has 10 Cores, and none of the cores are
> sharded.
>
> The initial startup is fine, the Solr instance comes up and we build our
> index. However if the Solr instance exits uncleanly (killed rather than
> sent a SIGINT), the next time it starts I see the following in the logs.
>
> 2015-01-22 09:56:23.236 -0800 (,,,) localhost-startStop-1 : INFO
> org.apache.solr.common.cloud.ZkStateReader - Updating cluster state from
> ZooKeeper...
> 2015-01-22 09:56:30.008 -0800 (,,,) localhost-startStop-1-EventThread :
> DEBUG org.apache.solr.common.cloud.SolrZkClient - Submitting job to respond
> to event WatchedEvent state:SyncConnected type:NodeChildrenChanged
> path:/live_nodes
> 2015-01-22 09:56:30.008 -0800 (,,,) zkCallback-2-thread-1 : DEBUG
> org.apache.solr.common.cloud.ZkStateReader - Updating live nodes... (0)
> 2015-01-22 09:57:24.102 -0800 (,,,) localhost-startStop-1 : WARN
> org.apache.solr.cloud.ZkController - Timed out waiting to see all nodes
> published as DOWN in our cluster state.
> 2015-01-22 09:57:24.102 -0800 (,,,) localhost-startStop-1 : INFO
> org.apache.solr.cloud.ZkController - Register node as live in
> ZooKeeper:/live_nodes/10.18.8.113:11000_solr
> My question is about "Timed out waiting to see all nodes published as DOWN
> in our cluster state."
>
> Cursory look at the code, we seem to iterate through all
> Collections/Shards, and mark the state as Down. These notifications are
> offered to the Overseer, who I believe updates the ZK state. We then wait
> for the ZK state to update, with the 60 second timeout.
>
> However, it looks like the Overseer is not started until after we wait for
> the timeout. So, in a single instance scenario we'll always have to wait
> for the timeout.
>
> Is this the expected behavior (and just a side effect of running a single
> instance in cloud mode), or is my understanding of the Overseer/Zk
> relationhip incorrect?
>
> Thanks.
>
> .Mike
>
>


-- 
Regards,
Shalin Shekhar Mangar.