You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Markus Jelsma <ma...@openindex.io> on 2018/04/03 21:04:27 UTC
RE: 7.2.1 cluster dies within minutes after restart

To clear things up, this has been resolved; the problem was present in our custom analyzers where we loaded dictionaries in the wrong method. If i remember correctly, we loaded them in createComponents (or the other one, don't have the code here), so it was per-thread loading of dictionaries.

Apparently, in some cases, Solr creates a lot of threads, causing our code to load the same dictionaries over and over.

Although we were at fault, i still don't understand the need Solr sometimes had for creating so many threads at unpredictable times. Be is just after start up a few times, to hours or days since start up, while our query load and document ingestion rate stayed the same.

Thanks,
Markus

 
 
-----Original message-----
> From:Markus Jelsma <ma...@openindex.io>
> Sent: Friday 26th January 2018 18:03
> To: Solr-user <so...@lucene.apache.org>
> Subject: 7.2.1 cluster dies within minutes after restart
> 
> Hello,
> 
> We recently upgraded our clusters from 7.1 to 7.2.1. One collection (2 shard, 2 replica) specifically is in a bad state almost continuously, After proper restart the cluster is all green. Within minutes the logs are flooded with many bad omens:
> 
> o.a.z.ClientCnxn Client session timed out, have not heard from server in 22130ms (although zkClientTimeOut is 30000).
> o.a.s.c.Overseer could not read the data
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/leader
> o.a.s.c.c.DefaultConnectionStrategy Reconnect to ZooKeeper failed:org.apache.solr.common.cloud.ZooKeeperException: A ZK error has occurred
> o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error trying to proxy request for url
> etc etc etc
> 2018-01-26 16:43:31.419 WARN  (OverseerAutoScalingTriggerThread-171411573518537026-logs4.gr.nl.openindex.io:8983_solr-n_0000001853) [   ] o.a.s.c.a.OverseerTriggerThread OverseerTriggerThread woken up but we are closed, exiting.
> 
> Soon most nodes are gone, maybe one is still green or yellow (recovering from another dead node).
> 
> A point of interest is that this collection is always under maximum load, receiving  hundreds of queries per node per second. We disabled the querying of the cluster and restarted it again, this time it kept running fine, and it continued to run fine even when we slowly restarted the tons of queries that need to be fired.
> 
> We just reverted the modifications above, the cluster now receives full load of queries as soon as it is available, everything was restarted and everything is suddenly fine again.
> 
> We really have no clue why for a days everything is fine, then we suddenly come into some weird flow (loaded with o.a.z.ClientCnxn Client session timed out msgs) and it takes several full restarts for things to settle down. Then all is fine until this afternoon where for two hours long the cluster kept dying almost instantly. And at this moment, all is well, again, it seems. The only steady companion when things go bad are the time outs related to ZK.
> 
> Under normal circumstances, we do not time out due to GC, the heap is just 2 GB. Query response times are ~10 ms even when under maximum load. We would like to know why and how it enters a 'bad state' for no apparent reason. Any ideas? 
> 
> Many thanks!
> Markus
> 
> side note: This cluster always has been a pain but 7.2.1 made something worse, reverting to 7.1 is not possible due to index being too new (there were no notes in CHANGES indicateing an index incompatibility between these two minor versions).
> 
> 
>