You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jeff Wartes <jw...@whitepages.com> on 2014/02/21 18:23:11 UTC

ZK connection problems

I’ve been experimenting with SolrCloud configurations in AWS. One issue I’ve been plagued with is that during indexing, occasionally a node decides it can’t talk to ZK, and this disables updates in the pool. The node usually recovers within a second or two. It’s possible this happens when I’m not indexing too, but I’m much less likely to notice.

I’ve seen this with multiple sharding configurations and multiple cluster sizes. I’ve searched around, and I think I’ve addressed the usual resolutions when someone complains about ZK and Solr. I’m using:

* 60-sec ZK connection timeout (although this seems like a pretty terrible requirement)
* Independent 3-node ZK cluster, also in AWS.
* Solr 4.6.1
* Optimized GC settings (and I’ve confirmed no GC pauses are occurring)
* 5-min auto-hard-commit with openSearcher=false

I’m indexing some 10K docs/sec using CloudSolrServer, but the CPU usage on the nodes doesn’t exceed 20%, typically it’s around 5%.

Here is the relevant section of logs from one of the nodes when this happened:
http://pastebin.com/K0ZdKmL4

It looks like it had a connection timeout, and tried to re-establish the same session on a connection to a new ZK node, except the session had also expired. It then closes *that* connection, changes to read-only mode, and eventually creates a new connection and new session which allows writes again.

Can anyone familiar with the ZK connection/session stuff comment on whether this is a bug? I really know nothing about proper ZK client behaviour.

Thanks.

Re: ZK connection problems

Posted by Mark Miller <ma...@gmail.com>.


On Feb 21, 2014, at 12:23 PM, Jeff Wartes <jw...@whitepages.com> wrote:

> 
> I’ve been experimenting with SolrCloud configurations in AWS. One issue I’ve been plagued with is that during indexing, occasionally a node decides it can’t talk to ZK, and this disables updates in the pool. The node usually recovers within a second or two. It’s possible this happens when I’m not indexing too, but I’m much less likely to notice.
> 
> I’ve seen this with multiple sharding configurations and multiple cluster sizes. I’ve searched around, and I think I’ve addressed the usual resolutions when someone complains about ZK and Solr. I’m using:
> 
>  *   60-sec ZK connection timeout (although this seems like a pretty terrible requirement)

Be aware that it maxes out at like 40 or 45 seconds with the default tickTime of 2000.

>  *   Independent 3-node ZK cluster, also in AWS.
>  *   Solr 4.6.1
>  *   Optimized GC settings (and I’ve confirmed no GC pauses are occurring)
>  *   5-min auto-hard-commit with openSearcher=false
> 
> I’m indexing some 10K docs/sec using CloudSolrServer, but the CPU usage on the nodes doesn’t exceed 20%, typically it’s around 5%.
> 
> Here is the relevant section of logs from one of the nodes when this happened:
> http://pastebin.com/K0ZdKmL4
> 
> It looks like it had a connection timeout, and tried to re-establish the same session on a connection to a new ZK node, except the session had also expired. It then closes *that* connection, changes to read-only mode, and eventually creates a new connection and new session which allows writes again.
> 
> Can anyone familiar with the ZK connection/session stuff comment on whether this is a bug? I really know nothing about proper ZK client behaviour.
> 
> Thanks.
> 

You have to figure out why Solr is not able to talk to ZooKeeper for 40-60 seconds. Perhaps it’s the network, perhaps it’s the…I’m not sure. But for some reason a very simple heart beat cannot occur for a long time - and for Solr to receive updates, it has to maintain a connection with ZooKeeper. You can either raise the timeout, or dig into why the connection heartbeat cannot be maintained (its very lightweight). 

- Mark

http://about.me/markrmiller