You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Raintung Li (JIRA)" <ji...@apache.org> on 2014/05/12 07:35:14 UTC

[jira] [Commented] (SOLR-6056) Zookeeper crash JVM stack OOM because of recover strategy

    [ https://issues.apache.org/jira/browse/SOLR-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994821#comment-13994821 ] 

Raintung Li commented on SOLR-6056:
-----------------------------------

1. Move report the status from the  coreadminhandle to the doRecovery method, only one thread will report this status.
2.  While find the thread is working for recovery, the other recovery thread will quit except set the parameter to enforce recovery.

> Zookeeper crash JVM stack OOM because of recover strategy 
> ----------------------------------------------------------
>
>                 Key: SOLR-6056
>                 URL: https://issues.apache.org/jira/browse/SOLR-6056
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.6
>         Environment: Two linux servers, 65G memory, 16 core cpu
> 20 collections, every collection has one shard two replica 
> one zookeeper
>            Reporter: Raintung Li
>            Priority: Critical
>              Labels: cluster, crash, recover
>         Attachments: patch-6056.txt
>
>
> Some errors"org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later", that occur distributedupdateprocessor trig the core admin recover process.
> That means every update request will send the core admin recover request.
> (see the code DistributedUpdateProcessor.java doFinish())
> The terrible thing is CoreAdminHandler will start a new thread to publish the recover status and start recovery. Threads increase very quickly, and stack OOM , Overseer can't handle a lot of status update , zookeeper node for  /overseer/queue/qn-0000125553 increase more than 40 thousand in two minutes.
> At the last zookeeper crash. 
> The worse thing is queue has too much nodes in the zookeeper, the cluster can't publish the right status because only one overseer work, I have to start three threads to clear the queue nodes. The cluster doesn't work normal near 30 minutes...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org