You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ishan Chattopadhyaya (JIRA)" <ji...@apache.org> on 2016/11/17 17:08:59 UTC
[jira] [Commented] (SOLR-6056) Zookeeper crash JVM stack OOM
because of recover strategy
[ https://issues.apache.org/jira/browse/SOLR-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674255#comment-15674255 ]
Ishan Chattopadhyaya commented on SOLR-6056:
--------------------------------------------
It seems #1 was committed here and #2 was dealt with at SOLR-8371. [~shalinmangar], [~markrmiller@gmail.com], can we link SOLR-8371 as a related issue, and close this?
> Zookeeper crash JVM stack OOM because of recover strategy
> ----------------------------------------------------------
>
> Key: SOLR-6056
> URL: https://issues.apache.org/jira/browse/SOLR-6056
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.6
> Environment: Two linux servers, 65G memory, 16 core cpu
> 20 collections, every collection has one shard two replica
> one zookeeper
> Reporter: Raintung Li
> Assignee: Shalin Shekhar Mangar
> Priority: Critical
> Labels: cluster, crash, recover
> Attachments: patch-6056.txt
>
>
> Some errors"org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later", that occur distributedupdateprocessor trig the core admin recover process.
> That means every update request will send the core admin recover request.
> (see the code DistributedUpdateProcessor.java doFinish())
> The terrible thing is CoreAdminHandler will start a new thread to publish the recover status and start recovery. Threads increase very quickly, and stack OOM , Overseer can't handle a lot of status update , zookeeper node for /overseer/queue/qn-0000125553 increase more than 40 thousand in two minutes.
> At the last zookeeper crash.
> The worse thing is queue has too much nodes in the zookeeper, the cluster can't publish the right status because only one overseer work, I have to start three threads to clear the queue nodes. The cluster doesn't work normal near 30 minutes...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org