You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Gerald Reinhart <ge...@kelkoo.com> on 2016/10/06 09:23:53 UTC

[Solr-5-4-1] Why SolrCloud leader is putting all replicas in recovery at the same time ?

Hello everyone,

    Our Solr Cloud  works very well for several months without any significant changes: the traffic to serve is stable, no major release deployed...

    But randomly, the Solr Cloud leader puts all the replicas in recovery at the same time for no obvious reason.

    Hence, we can not serve the queries any more and the leader is overloaded while replicating all the indexes on the replicas at the same time which eventually implies a downtime of approximately 30 minutes.

    Is there a way to prevent it ? Ideally, a configuration saying a percentage of replicas to be put in recovery at the same time?

Thanks,

Gérald, Elodie and Ludovic


--
[cid:part1.00000508.06030105@kelkoo.com]

Gérald Reinhart Software Engineer

E <ma...@kelkoo.com> <ma...@kelkoo.com> gerald.reinhart@kelkoo.com<ma...@kelkoo.com>    Y!Messenger gerald.reinhart
T +33 (0)4 56 09 07 41
A Parc Sud Galaxie - Le Calypso, 6 rue des Méridiens, 38130 Echirolles


[cid:part4.08030706.00010405@kelkoo.com]



________________________________
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.

Re: [Solr-5-4-1] Why SolrCloud leader is putting all replicas in recovery at the same time ?

Posted by Petetin Ludovic <lu...@kelkoo.com>.

Hi Erick,

thanks for taking time to answer.

What we are looking for here is not the root cause but a mechanism to prevent the leader from doing this silly thing of putting all the replicas in recovery at the same time which inevitably leads to a downtime.

Is there a similar conf in Solr as the conf below in Elastic Search which guarantees that we have always enough nodes up to serve the traffic ?

gateway.recover_after_nodes
Recover as long as this many data or master nodes have joined the cluster.
To answer your questions as well, when the problem occured, we had :
- no traffic peak
- no JVM memory issue
- no system memory issue
- no network issue
- no disk space issue
- no Zookeepers issue

The impacted cluster is composed of 6 DELL R420 servers with 96GB of RAM, with an index of 10M of documents serving our traffic (400 queries per second) in France.
We have several other Solr clusters in the same datacenter sharing the same Zookeepers but serving other European and American countries which were not impacted when the issue occured.

I paste below the logs of the leader at the moment of the issue and the logs of the replica which is put in recovery in this paste (solr-57).

1. Log of the leader
2016-10-04 18:18:46,326 [searcherExecutor-87-thread-1-processing-s:shard1 x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.core.QuerySenderListener:newSearcher:49  - QuerySenderListener sending requests to Searcher@43a597cb[fr_blue] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_gxjd(5.4.1):C3804213/811353:delGen=2) Uninverting(_gwmj(5.4.1):C3449095/1848374:delGen=8) Uninverting(_gxkn(5.4.1):C2609089/459163:delGen=3) Uninverting(_gxs7(5.4.1):c669028/10913:delGen=1) Uninverting(_gxxf(5.4.1):c454860/104356:delGen=1) Uninverting(_gy55(5.4.1):c373416/59027) Uninverting(_gybq(5.4.1):c222631/1631) Uninverting(_gy6c(5.4.1):c29736/591:delGen=1) Uninverting(_gy76(5.4.1):C2020/45:delGen=2) Uninverting(_gy74(5.4.1):C1909/40:delGen=2) Uninverting(_gyhk(5.4.1):c192078/656) Uninverting(_gye4(5.4.1):c66133/977) Uninverting(_gyio(5.4.1):c43028/274) Uninverting(_gyjk(5.4.1):c48841/276) Uninverting(_gykl(5.4.1):c42989/101) Uninverting(_gylf(5.4.1):c31401/50) Uninverting(_gylt(5.4.1):c5802/6) Uninverting(_gyln(5.4.1):c4119/1) Uninverting(_gyn1(5.4.1):c36015/68) Uninverting(_gyo6(5.4.1):c27906/18) Uninverting(_gyox(5.4.1):c23494/4) Uninverting(_gyp3(5.4.1):c3686) Uninverting(_gyov(5.4.1):C1124) Uninverting(_gyot(5.4.1):C1224/1:delGen=1) Uninverting(_gyp0(5.4.1):C514/1:delGen=1) Uninverting(_gyoy(5.4.1):C573/1:delGen=1) Uninverting(_gyp1(5.4.1):C120/1:delGen=1) Uninverting(_gyou(5.4.1):C747/1:delGen=1) Uninverting(_gyor(5.4.1):C467) Uninverting(_gyp2(5.4.1):C101) Uninverting(_gyoz(5.4.1):C250)))}
2016-10-04 18:18:46,326 [searcherExecutor-87-thread-1-processing-s:shard1 x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.core.QuerySenderListener:newSearcher:96  - QuerySenderListener done.
2016-10-04 18:25:55,647 [commitScheduler-89-thread-1] INFO  org.apache.solr.core.SolrDeletionPolicy:onCommit:99  - SolrDeletionPolicy.onCommit: commits: num=2
        commit{dir=/opt/kookel/data/searchSolrNode/solrindex/fr1_blue/index.20160731234345634,segFN=segments_7ma,generation=9874}
        commit{dir=/opt/kookel/data/searchSolrNode/solrindex/fr1_blue/index.20160731234345634,segFN=segments_7mb,generation=9875}
2016-10-04 18:25:55,650 [commitScheduler-89-thread-1] INFO  org.apache.solr.core.SolrDeletionPolicy:updateCommits:166  - newest commit generation = 9875
2016-10-04 18:25:56,010 [commitScheduler-89-thread-1] INFO  org.apache.solr.search.SolrIndexSearcher:<init>:237  - Opening Searcher@39da46db[fr_blue] realtime
2016-10-04 18:40:56,195 [commitScheduler-89-thread-1] INFO  org.apache.solr.core.SolrDeletionPolicy:onCommit:99  - SolrDeletionPolicy.onCommit: commits: num=2
        commit{dir=/opt/kookel/data/searchSolrNode/solrindex/fr1_blue/index.20160731234345634,segFN=segments_7mb,generation=9875}
        commit{dir=/opt/kookel/data/searchSolrNode/solrindex/fr1_blue/index.20160731234345634,segFN=segments_7mc,generation=9876}
2016-10-04 18:40:56,196 [commitScheduler-89-thread-1] INFO  org.apache.solr.core.SolrDeletionPolicy:updateCommits:166  - newest commit generation = 9876
2016-10-04 18:40:56,387 [commitScheduler-89-thread-1] INFO  org.apache.solr.search.SolrIndexSearcher:<init>:237  - Opening Searcher@7ce3335c[fr_blue] realtime
2016-10-04 18:43:39,426 [updateExecutor-2-thread-105261-processing-http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue s:shard1 x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] ERROR org.apache.solr.update.StreamingSolrClients:handleError:78  - error
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:196)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
        at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
        at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
        at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:279)
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:160)
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2016-10-04 18:43:39,430 [http-8080-86] WARN  org.apache.solr.update.processor.DistributedUpdateProcessor:doFinish:792  - Error sending update to http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:196)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
        at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
        at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
        at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:279)
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:160)
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2016-10-04 18:43:39,430 [http-8080-86] ERROR org.apache.solr.update.processor.DistributedUpdateProcessor:doFinish:886  - Setting up to try to start recovery on replica http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr/fr_blue/
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:196)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
        at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
        at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
        at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:279)
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:160)
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2016-10-04 18:43:39,431 [http-8080-86] INFO  org.apache.solr.cloud.ZkController:ensureReplicaInLeaderInitiatedRecovery:2054  - Put replica core=fr_blue coreNodeName=dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue on dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr into leader-initiated recovery.
2016-10-04 18:43:39,432 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue// x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.common.cloud.SolrZkClient:makePath:472  - makePath: /collections/fr_blue/leader_initiated_recovery/shard1
2016-10-04 18:43:39,441 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue// x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.cloud.ZkController:updateLeaderInitiatedRecoveryState:2194  - Wrote down to /collections/fr_blue/leader_initiated_recovery/shard1/dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue
2016-10-04 18:43:39,442 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue// x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.cloud.LeaderInitiatedRecoveryThread:publishDownState:123  - Put replica core=fr_blue coreNodeName=dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue on dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr into leader-initiated recovery.
2016-10-04 18:43:39,442 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue// x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] WARN  org.apache.solr.cloud.LeaderInitiatedRecoveryThread:publishDownState:158  - Leader is publishing core=fr_blue coreNodeName =dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue state=down on behalf of un-reachable replica http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr/fr_blue/
2016-10-04 18:43:39,448 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue// x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.cloud.LeaderInitiatedRecoveryThread:sendRecoveryCommandWithRetry:190  - LeaderInitiatedRecoveryThread-fr_blue started running to send REQUESTRECOVERY command to http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr/fr_blue/; will try for a max of 600 secs
2016-10-04 18:43:39,449 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue// x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.cloud.LeaderInitiatedRecoveryThread:sendRecoveryCommandWithRetry:202  - Asking core=fr_blue coreNodeName=dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue on http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr to recover
2016-10-04 18:43:39,450 [OverseerStateUpdate-1970092343297114116-dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr-n_0000001954] INFO  org.apache.solr.cloud.Overseer:run:213  - processMessage: queueSize: 1, message = {
  "operation":"state",
  "state":"down",
  "base_url":"http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr"<http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr>,
  "core":"fr_blue",
  "node_name":"dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr",
  "shard":"shard1",
  "collection":"fr_blue"} current state version: 30797
2016-10-04 18:43:39,453 [OverseerStateUpdate-1970092343297114116-dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr-n_0000001954] INFO  org.apache.solr.cloud.overseer.ReplicaMutator:updateState:211  - Update state numShards=null message={
  "operation":"state",
  "state":"down",
  "base_url":"http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr"<http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr>,
  "core":"fr_blue",
  "node_name":"dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr",
  "shard":"shard1",
  "collection":"fr_blue"}
2016-10-04 18:43:39,453 [OverseerStateUpdate-1970092343297114116-dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr-n_0000001954] INFO  org.apache.solr.cloud.overseer.ReplicaMutator:updateState:239  - node=dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue is already registered
2016-10-04 18:43:39,457 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue// x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.cloud.LeaderInitiatedRecoveryThread:sendRecoveryCommandWithRetry:211  - Successfully sent REQUESTRECOVERY command to core=fr_blue coreNodeName=dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue on http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr
2016-10-04 18:43:39,460 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue// x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.cloud.LeaderInitiatedRecoveryThread:run:111  - LeaderInitiatedRecoveryThread-fr_blue completed successfully after running for 29.0ms
2016-10-04 18:43:39,466 [zkCallback-4-thread-118-processing-n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr] INFO  org.apache.solr.common.cloud.ZkStateReader:process:915  - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 6)
2016-10-04 18:43:39,516 [OverseerStateUpdate-1970092343297114116-dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr-n_0000001954] INFO  org.apache.solr.cloud.Overseer:run:213  - processMessage: queueSize: 1, message = {
  "operation":"state",
  "core_node_name":"dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue",
  "numShards":"1",
  "shard":"shard1",
  "roles":null,
  "state":"recovering",
  "core":"fr_blue",
  "collection":"fr_blue",
  "node_name":"dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr",
  "base_url":"http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr"<http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr>} current state version: 30798
2016-10-04 18:43:39,518 [OverseerStateUpdate-1970092343297114116-dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr-n_0000001954] INFO  org.apache.solr.cloud.overseer.ReplicaMutator:updateState:211  - Update state numShards=1 message={
  "operation":"state",
  "core_node_name":"dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue",
  "numShards":"1",
  "shard":"shard1",
  "roles":null,
  "state":"recovering",
  "core":"fr_blue",
  "collection":"fr_blue",
  "node_name":"dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr",
  "base_url":"http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr"<http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr>}


2. Log the replica (solr-57) which is put in recovery
2016-10-04 18:40:57,014 [commitScheduler-95-thread-1] INFO  org.apache.solr.core.SolrDeletionPolicy:onCommit:99  - SolrDeletionPolicy.onCommit: commits: num=2
        commit{dir=/opt/kookel/data/searchSolrNode/solrindex/fr1_blue/index.20161002014159446,segFN=segments_7mb,generation=9875}
        commit{dir=/opt/kookel/data/searchSolrNode/solrindex/fr1_blue/index.20161002014159446,segFN=segments_7mc,generation=9876}
2016-10-04 18:40:57,014 [commitScheduler-95-thread-1] INFO  org.apache.solr.core.SolrDeletionPolicy:updateCommits:166  - newest commit generation = 9876
2016-10-04 18:40:57,293 [commitScheduler-95-thread-1] INFO  org.apache.solr.search.SolrIndexSearcher:<init>:237  - Opening Searcher@6d09e525[fr_blue] realtime
2016-10-04 18:43:39,453 [http-8080-92] INFO  org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:812  - It has been requested that we recover: core=fr_blue
2016-10-04 18:43:39,453 [Thread-30289] INFO  org.apache.solr.cloud.ActionThrottle:minimumWaitBetweenActions:72  - The last recovery attempt started 234111418ms ago.
2016-10-04 18:43:39,456 [updateExecutor-2-thread-10-processing-s:shard1 x:fr_blue c:fr_blue n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.cloud.RecoveryStrategy:run:223  - Starting recovery process. recoveringAfterStartup=false
2016-10-04 18:43:39,456 [http-8080-92] INFO  org.apache.solr.servlet.HttpSolrCall:handleAdminRequest:667  - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERY&core=fr_blue&wt=javabin&version=2} status=0 QTime=3
2016-10-04 18:43:39,457 [updateExecutor-2-thread-10-processing-s:shard1 x:fr_blue c:fr_blue n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.cloud.RecoveryStrategy:doRecovery:342  - Begin buffering updates. core=fr_blue
2016-10-04 18:43:39,466 [zkCallback-4-thread-84-processing-n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr] INFO  org.apache.solr.common.cloud.ZkStateReader:process:915  - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 6)
2016-10-04 18:43:39,505 [updateExecutor-2-thread-10-processing-s:shard1 x:fr_blue c:fr_blue n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.cloud.RecoveryStrategy:doRecovery:346  - Publishing state of core fr_blue as recovering, leader is http://dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080/searchsolrnodefr/fr_blue/ and I am http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr/fr_blue/
2016-10-04 18:43:39,505 [updateExecutor-2-thread-10-processing-s:shard1 x:fr_blue c:fr_blue n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.cloud.ZkController:publish:1199  - publishing state=recovering
2016-10-04 18:43:39,505 [updateExecutor-2-thread-10-processing-s:shard1 x:fr_blue c:fr_blue n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.cloud.ZkController:publish:1203  - numShards not found on descriptor - reading it from system property
2016-10-04 18:43:39,508 [updateExecutor-2-thread-10-processing-s:shard1 x:fr_blue c:fr_blue n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.cloud.ZkController:updateLeaderInitiatedRecoveryState:2194  - Wrote recovering to /collections/fr_blue/leader_initiated_recovery/shard1/dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue
2016-10-04 18:43:39,513 [updateExecutor-2-thread-10-processing-s:shard1 x:fr_blue c:fr_blue n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] INFO  org.apache.solr.cloud.RecoveryStrategy:sendPrepRecoveryCmd:605  - Sending prep recovery command to http://dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080/searchsolrnodefr; WaitForState: action=PREPRECOVERY&core=fr_blue&nodeName=dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr&coreNodeName=dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true
2016-10-04 18:43:39,627 [zkCallback-4-thread-84-processing-n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr] INFO  org.apache.solr.common.cloud.ZkStateReader:process:915  - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 6)

Thanks,

Ludo



-------- Forwarded Message --------
Subject:        Re: [Solr-5-4-1] Why SolrCloud leader is putting all replicas in recovery at the same time ?
Date:   Thu, 6 Oct 2016 16:00:01 +0200
From:   Erick Erickson <er...@gmail.com>
Reply-To:       solr-user@lucene.apache.org<ma...@lucene.apache.org> <so...@lucene.apache.org>
To:     solr-user <so...@lucene.apache.org>


There is no information here at all that would us to say anything meaningful. You might review:
http://wiki.apache.org/solr/UsingMailingLists

What do the logs say? Are there any exceptions? What happens on your system that's
unusual if anything? In short, what have you tried to do to diagnose the cause and what
have you learned?

But here's a random guess. You didn't configure your log4j properties and your logs, particularly
your console log have grown to huge sizes and occasionally you encounter disk full issues. Or
your ZK nodes have the same problem, they accumulate snapshots (see the Zookeeper admin page).

Best,
Erick

On Thu, Oct 6, 2016 at 2:23 AM, Gerald Reinhart <ge...@kelkoo.com>> wrote:

Hello everyone,

    Our Solr Cloud  works very well for several months without any significant changes: the traffic to serve is stable, no major release deployed...

    But randomly, the Solr Cloud leader puts all the replicas in recovery at the same time for no obvious reason.

    Hence, we can not serve the queries any more and the leader is overloaded while replicating all the indexes on the replicas at the same time which eventually implies a downtime of approximately 30 minutes.

    Is there a way to prevent it ? Ideally, a configuration saying a percentage of replicas to be put in recovery at the same time?

Thanks,

Gérald, Elodie and Ludovic


________________________________
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.

Re: [Solr-5-4-1] Why SolrCloud leader is putting all replicas in recovery at the same time ?

Posted by Erick Erickson <er...@gmail.com>.

There is no information here at all that would us to say anything
meaningful. You might review:
http://wiki.apache.org/solr/UsingMailingLists

What do the logs say? Are there any exceptions? What happens on your system
that's
unusual if anything? In short, what have you tried to do to diagnose the
cause and what
have you learned?

But here's a random guess. You didn't configure your log4j properties and
your logs, particularly
your console log have grown to huge sizes and occasionally you encounter
disk full issues. Or
your ZK nodes have the same problem, they accumulate snapshots (see the
Zookeeper admin page).

Best,
Erick

On Thu, Oct 6, 2016 at 2:23 AM, Gerald Reinhart <ge...@kelkoo.com>
wrote:

>
> Hello everyone,
>
>     Our Solr Cloud  works very well for several months without any
> significant changes: the traffic to serve is stable, no major release
> deployed...
>
>     But randomly, the Solr Cloud leader puts all the replicas in recovery
> at the same time for no obvious reason.
>
>     Hence, we can not serve the queries any more and the leader is
> overloaded while replicating all the indexes on the replicas at the same
> time which eventually implies a downtime of approximately 30 minutes.
>
>     Is there a way to prevent it ? Ideally, a configuration saying a
> percentage of replicas to be put in recovery at the same time?
>
> Thanks,
>
> Gérald, Elodie and Ludovic
>
>
> --
> [image: Kelkoo]
>
> *Gérald Reinhart *Software Engineer
>
> *E* <ge...@kelkoo.com> <ge...@kelkoo.com>
> gerald.reinhart@kelkoo.com    *Y!Messenger* gerald.reinhart
> *T* +33 (0)4 56 09 07 41
> *A* Parc Sud Galaxie - Le Calypso, 6 rue des Méridiens, 38130 Echirolles
>
>
>
> ------------------------------
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 158 Ter Rue du Temple 75003 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>

Re: [Solr-5-4-1] Why SolrCloud leader is putting all replicas in recovery at the same time ?

Posted by Gerald Reinhart <ge...@kelkoo.com>.

Hi Pushkar Raste,

Thanks for your hits.
We will try the 3rd solution and keep you posted.

Gérald Reinhart

On 10/07/2016 02:23 AM, Pushkar Raste wrote:
A couple of questions/suggestions
- This normally happens after leader election, when new leader gets elected, it will force all the nodes to sync with itself.
Check logs to see when this happens, if leader was changed. If that is true then you will have to investigate why leader change takes place.
I suspect leader goes into long enough GC pause that makes zookeeper leader is no longer available and initiates leader election.

- What version of Solr you are using. SOLR-8586<https://issues.apache.org/jira/browse/SOLR-8586> introduced IndexFingerprint check, unfortunately it was broken and hence replica would always do full index replication. Issue is now fixed in SOLR-9310<https://issues.apache.org/jira/browse/SOLR-9310>, this should help replicas recover faster.

- You should also increase ulog log size (default threshold is 100 docs or 10 tlogs whichever is hit first). This will again help replicas recover faster from tlogs (of course, there would be a threshold after which recovering from tlog would in fact take longer than copying over all the index files from leader)

On Thu, Oct 6, 2016 at 5:23 AM, Gerald Reinhart <ge...@kelkoo.com>> wrote:

Hello everyone,

Our Solr Cloud works very well for several months without any significant changes: the traffic to serve is stable, no major release deployed...

But randomly, the Solr Cloud leader puts all the replicas in recovery at the same time for no obvious reason.

Hence, we can not serve the queries any more and the leader is overloaded while replicating all the indexes on the replicas at the same time which eventually implies a downtime of approximately 30 minutes.

Is there a way to prevent it ? Ideally, a configuration saying a percentage of replicas to be put in recovery at the same time?

Thanks,

Gérald, Elodie and Ludovic

--
[cid:part1.00000508.06030105@kelkoo.com]

Gérald Reinhart Software Engineer

E <ma...@kelkoo.com> <ma...@kelkoo.com> gerald.reinhart@kelkoo.com<ma...@kelkoo.com> Y!Messenger gerald.reinhart
T +33 (0)4 56 09 07 41<tel:%2B33%20%280%294%2056%2009%2007%2041>
A Parc Sud Galaxie - Le Calypso, 6 rue des Méridiens, 38130 Echirolles

[cid:part4.08030706.00010405@kelkoo.com]

________________________________
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.

--
[cid:part9.00000705.09050102@kelkoo.com]

Gérald Reinhart Software Engineer

E <ma...@kelkoo.com> <ma...@kelkoo.com> gerald.reinhart@kelkoo.com<ma...@kelkoo.com> Y!Messenger gerald.reinhart
T +33 (0)4 56 09 07 41
A Parc Sud Galaxie - Le Calypso, 6 rue des Méridiens, 38130 Echirolles

[cid:part12.05000204.08070006@kelkoo.com]

________________________________
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Re: [Solr-5-4-1] Why SolrCloud leader is putting all replicas in recovery at the same time ?

Posted by Pushkar Raste <pu...@gmail.com>.

A couple of questions/suggestions
- This normally happens after leader election, when new leader gets
elected, it will force all the nodes to sync with itself.
Check logs to see when this happens, if leader was changed. If that is true
then you will have to investigate why leader change takes place.
I suspect leader goes into long enough GC pause that makes zookeeper leader
is no longer available and initiates leader election.

- What version of Solr you are using.  SOLR-8586
<https://issues.apache.org/jira/browse/SOLR-8586> introduced
IndexFingerprint check, unfortunately it was broken and hence replica would
always do full index replication. Issue is now fixed in SOLR-9310
<https://issues.apache.org/jira/browse/SOLR-9310>, this should help
replicas recover faster.

- You should also increase ulog log size (default threshold is 100 docs or
10 tlogs whichever is hit first). This will again help replicas recover
faster from tlogs (of course, there would be a threshold after which
recovering from tlog would in fact take longer than copying over all the
index files from leader)


On Thu, Oct 6, 2016 at 5:23 AM, Gerald Reinhart <ge...@kelkoo.com>
wrote:

>
> Hello everyone,
>
>     Our Solr Cloud  works very well for several months without any
> significant changes: the traffic to serve is stable, no major release
> deployed...
>
>     But randomly, the Solr Cloud leader puts all the replicas in recovery
> at the same time for no obvious reason.
>
>     Hence, we can not serve the queries any more and the leader is
> overloaded while replicating all the indexes on the replicas at the same
> time which eventually implies a downtime of approximately 30 minutes.
>
>     Is there a way to prevent it ? Ideally, a configuration saying a
> percentage of replicas to be put in recovery at the same time?
>
> Thanks,
>
> Gérald, Elodie and Ludovic
>
>
> --
> [image: Kelkoo]
>
> *Gérald Reinhart *Software Engineer
>
> *E* <ge...@kelkoo.com> <ge...@kelkoo.com>
> gerald.reinhart@kelkoo.com    *Y!Messenger* gerald.reinhart
> *T* +33 (0)4 56 09 07 41
> *A* Parc Sud Galaxie - Le Calypso, 6 rue des Méridiens, 38130 Echirolles
>
>
>
> ------------------------------
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 158 Ter Rue du Temple 75003 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>