You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Henrik Ossipoff Hansen <ho...@entertainment-trading.com> on 2013/11/05 09:33:13 UTC

SolrCloud never fully recovers after slow disks

I previously made a post on this, but have since narrowed down the issue and am now giving this another try, with another spin to it.

We are running a 4 node setup (over Tomcat7) with a 3-ensemble external ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM is using our Storage system (NFS share in VMWare).

Now I do realize and have heard, that NFS is not the greatest way to run Solr on, but we have never had this issue on non-SolrCloud setups.

Basically, each night when we run our backup jobs, our storage becomes a bit slow in response - this is obviously something we’re trying to solve, but bottom line is, that all our other systems somehow stays alive or recovers gracefully when bandwidth exists again.
SolrCloud - not so much. Typically after a session like this, 3-5 nodes will either go into a Down state or a Recovering state - and stay that way. Sometimes such node will even be marked as leader. A such node will have something like this in the log:

ERROR - 2013-11-05 08:57:45.764; org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says we are the leader, but locally we don't think so
ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ClusterState says we are the leader (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally we don't think so. Request came from http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
        at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
        at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
        at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
        at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
        at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)

On the other nodes, an error similar to this will be in the log:

09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode: http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2 returned non ok status:503,​ message:Service Unavailable
09:27:34 -ERROR - SolrCmdDistributor forwarding update to http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - retrying ...

Does anyone have any ideas or leads towards a solution - one that doesn’t involve getting a new storage system (a solution we *are* actively working on, but that’s not a quick fix in our case). Shouldn’t a setup like this be possible? And even more so - shouldn’t SolrCloud be able to gracefully recover after issues like this?

--
Henrik Ossipoff Hansen
Developer, Entertainment Trading

RE: SolrCloud never fully recovers after slow disks

Posted by Henrik Ossipoff Hansen <ho...@entertainment-trading.com>.
Solr version is 4.5.0.

I have done some tweaking. Doubling my Zookeeper timeout values in zoo.cfg and the Zookeeper timeout in solr.xml seemed to somewhat minimize the problem, but it still did occur. I next stopped all larger batch indexing in the period where the issues happened, which also seemed to help somewhat. Now the next thing weirds me out a bit - I switched from using Tomcat7 to using the Jetty that ships with Solr, and that actually seems to have fixed the last issues (together with stopping a few smaller updates - very few).

During the "slow period" in the night, I get something like this:

03:11:49 ERROR ZkController There was a problem finding the leader in zk:org.apache.solr.common.SolrException: Could not get leader props
03:06:47 ERROR Overseer Could not create Overseer node
03:06:47 WARN LeaderElector
03:06:47 WARN ZkStateReader ZooKeeper watch triggered,​ but Solr cannot talk to ZK
03:07:41 WARN RecoveryStrategy Stopping recovery for zkNodeName=solr04.cd-et.com:8080_solr_auto_suggest_shard1_replica2core=auto_suggest_shard1_replica2

After this, the cluster state seems to be fine, and I'm not being spammed with errors in the log files.

Bottom line is that the issues are fixed for now it seems, but I still find it weird that Solr was not able to fully receover.

// Henrik Ossipoff

-----Original Message-----
From: Mark Miller [mailto:markrmiller@gmail.com] 
Sent: 10. november 2013 19:27
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud never fully recovers after slow disks

Which version of solr are you using? Regardless of your env, this is a fail safe that you should not hit. 

- Mark

> On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen <ho...@entertainment-trading.com> wrote:
> 
> I previously made a post on this, but have since narrowed down the issue and am now giving this another try, with another spin to it.
> 
> We are running a 4 node setup (over Tomcat7) with a 3-ensemble external ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM is using our Storage system (NFS share in VMWare).
> 
> Now I do realize and have heard, that NFS is not the greatest way to run Solr on, but we have never had this issue on non-SolrCloud setups.
> 
> Basically, each night when we run our backup jobs, our storage becomes a bit slow in response - this is obviously something we’re trying to solve, but bottom line is, that all our other systems somehow stays alive or recovers gracefully when bandwidth exists again.
> SolrCloud - not so much. Typically after a session like this, 3-5 nodes will either go into a Down state or a Recovering state - and stay that way. Sometimes such node will even be marked as leader. A such node will have something like this in the log:
> 
> ERROR - 2013-11-05 08:57:45.764; 
> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says we are the leader, but locally we don't think so ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ClusterState says we are the leader (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally we don't think so. Request came from http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
>        at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
>        at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
>        at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
>        at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
>        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
>        at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
>        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
>        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
>        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
>        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
>        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
>        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
>        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
>        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
>        at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
>        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
>        at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>        at java.lang.Thread.run(Thread.java:724)
> 
> On the other nodes, an error similar to this will be in the log:
> 
> 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode: 
> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apa
> che.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server 
> at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2 
> returned non ok status:503,​ message:Service Unavailable
> 09:27:34 -ERROR - SolrCmdDistributor forwarding update to http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - retrying ...
> 
> Does anyone have any ideas or leads towards a solution - one that doesn’t involve getting a new storage system (a solution we *are* actively working on, but that’s not a quick fix in our case). Shouldn’t a setup like this be possible? And even more so - shouldn’t SolrCloud be able to gracefully recover after issues like this?
> 
> --
> Henrik Ossipoff Hansen
> Developer, Entertainment Trading

Re: SolrCloud never fully recovers after slow disks

Posted by Yago Riveiro <ya...@gmail.com>.
Hi,   

I have sometimes this exception too, the recovering goes to an state of loop and I can only finish the recovering if I restart the replica that has the stuck core.

In my case I have ssd but replicas with 40 or 50 gigas. If I have 3 replicas in recovery mode and they are replicating from a same node, I have this error.

My rate of indexing is high too (~500 doc/s).

--  
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, November 11, 2013 at 8:27 AM, Henrik Ossipoff Hansen wrote:

> The joy was short-lived.
>  
> Tonight our environment was “down/slow” a bit longer than usual. It looks like two of our nodes never recovered, clusterstate says everything is active. All nodes are throwing this in the log (the nodes they have trouble reaching are the ones that are affected) - the error comes about several cores:
>  
> ERROR - 2013-11-11 09:16:42.735; org.apache.solr.common.SolrException; Error while trying to recover. core=products_se_shard1_replica2:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://solr04.cd-et.com:8080/solr
> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:431)
> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
> at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198)
> at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342)
> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
> Caused by: java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:150)
> at java.net.SocketInputStream.read(SocketInputStream.java:121)
> at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
> at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
> at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
> at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
> at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
> at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
> at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
> at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
> at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
> at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
> at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
> at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:717)
> at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:522)
> at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
> at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
> at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:365)
> ... 4 more
>  
> ERROR - 2013-11-11 09:16:42.736; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (30) core=products_se_shard1_replica2
> --
> Henrik Ossipoff Hansen
> Developer, Entertainment Trading
>  
>  
> On 10. nov. 2013 at 21.07.32, Henrik Ossipoff Hansen (hoha@entertainment-trading.com<ma...@entertainment-trading.com>) wrote:
>  
> Solr version is 4.5.0.
>  
> I have done some tweaking. Doubling my Zookeeper timeout values in zoo.cfg and the Zookeeper timeout in solr.xml seemed to somewhat minimize the problem, but it still did occur. I next stopped all larger batch indexing in the period where the issues happened, which also seemed to help somewhat. Now the next thing weirds me out a bit - I switched from using Tomcat7 to using the Jetty that ships with Solr, and that actually seems to have fixed the last issues (together with stopping a few smaller updates - very few).
>  
> During the "slow period" in the night, I get something like this:
>  
> 03:11:49 ERROR ZkController There was a problem finding the leader in zk:org.apache.solr.common.SolrException: Could not get leader props
> 03:06:47 ERROR Overseer Could not create Overseer node
> 03:06:47 WARN LeaderElector
> 03:06:47 WARN ZkStateReader ZooKeeper watch triggered,​ but Solr cannot talk to ZK
> 03:07:41 WARN RecoveryStrategy Stopping recovery for zkNodeName=solr04.cd-et.com:8080 (http://solr04.cd-et.com:8080)_solr_auto_suggest_shard1_replica2core=auto_suggest_shard1_replica2
>  
> After this, the cluster state seems to be fine, and I'm not being spammed with errors in the log files.
>  
> Bottom line is that the issues are fixed for now it seems, but I still find it weird that Solr was not able to fully receover.
>  
> // Henrik Ossipoff
>  
> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com]
> Sent: 10. november 2013 19:27
> To: solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org)
> Subject: Re: SolrCloud never fully recovers after slow disks
>  
> Which version of solr are you using? Regardless of your env, this is a fail safe that you should not hit.
>  
> - Mark
>  
> > On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen <hoha@entertainment-trading.com (mailto:hoha@entertainment-trading.com)> wrote:
> >  
> > I previously made a post on this, but have since narrowed down the issue and am now giving this another try, with another spin to it.
> >  
> > We are running a 4 node setup (over Tomcat7) with a 3-ensemble external ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM is using our Storage system (NFS share in VMWare).
> >  
> > Now I do realize and have heard, that NFS is not the greatest way to run Solr on, but we have never had this issue on non-SolrCloud setups.
> >  
> > Basically, each night when we run our backup jobs, our storage becomes a bit slow in response - this is obviously something we’re trying to solve, but bottom line is, that all our other systems somehow stays alive or recovers gracefully when bandwidth exists again.
> > SolrCloud - not so much. Typically after a session like this, 3-5 nodes will either go into a Down state or a Recovering state - and stay that way. Sometimes such node will even be marked as leader. A such node will have something like this in the log:
> >  
> > ERROR - 2013-11-05 08:57:45.764;
> > org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says we are the leader, but locally we don't think so ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ClusterState says we are the leader (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally we don't think so. Request came from http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
> > at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
> > at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
> > at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
> > at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
> > at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
> > at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> > at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> > at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
> > at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
> > at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
> > at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
> > at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> > at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> > at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
> > at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
> > at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
> > at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
> > at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
> > at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
> > at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
> > at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
> > at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
> > at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
> > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > at java.lang.Thread.run(Thread.java:724)
> >  
> > On the other nodes, an error similar to this will be in the log:
> >  
> > 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode:
> > http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apa
> > che.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server
> > at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2
> > returned non ok status:503,​ message:Service Unavailable
> > 09:27:34 -ERROR - SolrCmdDistributor forwarding update to http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - retrying ...
> >  
> > Does anyone have any ideas or leads towards a solution - one that doesn’t involve getting a new storage system (a solution we *are* actively working on, but that’s not a quick fix in our case). Shouldn’t a setup like this be possible? And even more so - shouldn’t SolrCloud be able to gracefully recover after issues like this?
> >  
> > --
> > Henrik Ossipoff Hansen
> > Developer, Entertainment Trading
> >  
>  
>  
>  



Re: SolrCloud never fully recovers after slow disks

Posted by Mark Miller <ma...@gmail.com>.
The socket read timeouts are actually fairly short for recovery - we should probably bump them up. Can you file a JIRA issue? It may be a symptom rather than a cause, but given a slow env, bumping them up makes sense. 

- Mark

> On Nov 11, 2013, at 8:27 AM, Henrik Ossipoff Hansen <ho...@entertainment-trading.com> wrote:
> 
> The joy was short-lived.
> 
> Tonight our environment was “down/slow” a bit longer than usual. It looks like two of our nodes never recovered, clusterstate says everything is active. All nodes are throwing this in the log (the nodes they have trouble reaching are the ones that are affected) - the error comes about several cores:
> 
> ERROR - 2013-11-11 09:16:42.735; org.apache.solr.common.SolrException; Error while trying to recover. core=products_se_shard1_replica2:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://solr04.cd-et.com:8080/solr
>        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:431)
>        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
>        at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198)
>        at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342)
>        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
> Caused by: java.net.SocketTimeoutException: Read timed out
>        at java.net.SocketInputStream.socketRead0(Native Method)
>        at java.net.SocketInputStream.read(SocketInputStream.java:150)
>        at java.net.SocketInputStream.read(SocketInputStream.java:121)
>        at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
>        at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
>        at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
>        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
>        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
>        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
>        at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
>        at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
>        at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
>        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
>        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
>        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:717)
>        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:522)
>        at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
>        at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
>        at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
>        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:365)
>        ... 4 more
> 
> ERROR - 2013-11-11 09:16:42.736; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (30) core=products_se_shard1_replica2
> --
> Henrik Ossipoff Hansen
> Developer, Entertainment Trading
> 
> 
> On 10. nov. 2013 at 21.07.32, Henrik Ossipoff Hansen (hoha@entertainment-trading.com<ma...@entertainment-trading.com>) wrote:
> 
> Solr version is 4.5.0.
> 
> I have done some tweaking. Doubling my Zookeeper timeout values in zoo.cfg and the Zookeeper timeout in solr.xml seemed to somewhat minimize the problem, but it still did occur. I next stopped all larger batch indexing in the period where the issues happened, which also seemed to help somewhat. Now the next thing weirds me out a bit - I switched from using Tomcat7 to using the Jetty that ships with Solr, and that actually seems to have fixed the last issues (together with stopping a few smaller updates - very few).
> 
> During the "slow period" in the night, I get something like this:
> 
> 03:11:49 ERROR ZkController There was a problem finding the leader in zk:org.apache.solr.common.SolrException: Could not get leader props
> 03:06:47 ERROR Overseer Could not create Overseer node
> 03:06:47 WARN LeaderElector
> 03:06:47 WARN ZkStateReader ZooKeeper watch triggered,​ but Solr cannot talk to ZK
> 03:07:41 WARN RecoveryStrategy Stopping recovery for zkNodeName=solr04.cd-et.com:8080_solr_auto_suggest_shard1_replica2core=auto_suggest_shard1_replica2
> 
> After this, the cluster state seems to be fine, and I'm not being spammed with errors in the log files.
> 
> Bottom line is that the issues are fixed for now it seems, but I still find it weird that Solr was not able to fully receover.
> 
> // Henrik Ossipoff
> 
> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com]
> Sent: 10. november 2013 19:27
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud never fully recovers after slow disks
> 
> Which version of solr are you using? Regardless of your env, this is a fail safe that you should not hit.
> 
> - Mark
> 
>> On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen <ho...@entertainment-trading.com> wrote:
>> 
>> I previously made a post on this, but have since narrowed down the issue and am now giving this another try, with another spin to it.
>> 
>> We are running a 4 node setup (over Tomcat7) with a 3-ensemble external ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM is using our Storage system (NFS share in VMWare).
>> 
>> Now I do realize and have heard, that NFS is not the greatest way to run Solr on, but we have never had this issue on non-SolrCloud setups.
>> 
>> Basically, each night when we run our backup jobs, our storage becomes a bit slow in response - this is obviously something we’re trying to solve, but bottom line is, that all our other systems somehow stays alive or recovers gracefully when bandwidth exists again.
>> SolrCloud - not so much. Typically after a session like this, 3-5 nodes will either go into a Down state or a Recovering state - and stay that way. Sometimes such node will even be marked as leader. A such node will have something like this in the log:
>> 
>> ERROR - 2013-11-05 08:57:45.764;
>> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says we are the leader, but locally we don't think so ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ClusterState says we are the leader (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally we don't think so. Request came from http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
>> at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
>> at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
>> at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
>> at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
>> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
>> at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>> at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>> at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
>> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
>> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
>> at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>> at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>> at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
>> at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
>> at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
>> at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
>> at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
>> at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>> at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
>> at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
>> at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
>> at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:724)
>> 
>> On the other nodes, an error similar to this will be in the log:
>> 
>> 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode:
>> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apa
>> che.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server
>> at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2
>> returned non ok status:503,​ message:Service Unavailable
>> 09:27:34 -ERROR - SolrCmdDistributor forwarding update to http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - retrying ...
>> 
>> Does anyone have any ideas or leads towards a solution - one that doesn’t involve getting a new storage system (a solution we *are* actively working on, but that’s not a quick fix in our case). Shouldn’t a setup like this be possible? And even more so - shouldn’t SolrCloud be able to gracefully recover after issues like this?
>> 
>> --
>> Henrik Ossipoff Hansen
>> Developer, Entertainment Trading

Re: SolrCloud never fully recovers after slow disks

Posted by Henrik Ossipoff Hansen <ho...@entertainment-trading.com>.
I will file a JIRA later today.

What I don’t get though (I haven’t looked much into any actual Solr code) is that at this point, our systems are running fine, so timeouts shouldn’t be an issue. Those two nodes though, is somehow left in a state where their response time is up to around 120k ms - which is fairly high - everything else is running like normal at this point.
--
Henrik Ossipoff Hansen
Developer, Entertainment Trading


On 11. nov. 2013 at 16.01.58, Mark Miller (markrmiller@gmail.com<ma...@gmail.com>) wrote:

The socket read timeouts are actually fairly short for recovery - we should probably bump them up. Can you file a JIRA issue? It may be a symptom rather than a cause, but given a slow env, bumping them up makes sense.

- Mark

> On Nov 11, 2013, at 8:27 AM, Henrik Ossipoff Hansen <ho...@entertainment-trading.com> wrote:
>
> The joy was short-lived.
>
> Tonight our environment was “down/slow” a bit longer than usual. It looks like two of our nodes never recovered, clusterstate says everything is active. All nodes are throwing this in the log (the nodes they have trouble reaching are the ones that are affected) - the error comes about several cores:
>
> ERROR - 2013-11-11 09:16:42.735; org.apache.solr.common.SolrException; Error while trying to recover. core=products_se_shard1_replica2:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://solr04.cd-et.com:8080/solr
> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:431)
> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
> at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198)
> at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342)
> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
> Caused by: java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:150)
> at java.net.SocketInputStream.read(SocketInputStream.java:121)
> at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
> at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
> at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
> at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
> at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
> at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
> at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
> at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
> at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
> at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
> at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
> at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:717)
> at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:522)
> at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
> at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
> at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:365)
> ... 4 more
>
> ERROR - 2013-11-11 09:16:42.736; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (30) core=products_se_shard1_replica2
> --
> Henrik Ossipoff Hansen
> Developer, Entertainment Trading
>
>
> On 10. nov. 2013 at 21.07.32, Henrik Ossipoff Hansen (hoha@entertainment-trading.com<ma...@entertainment-trading.com>) wrote:
>
> Solr version is 4.5.0.
>
> I have done some tweaking. Doubling my Zookeeper timeout values in zoo.cfg and the Zookeeper timeout in solr.xml seemed to somewhat minimize the problem, but it still did occur. I next stopped all larger batch indexing in the period where the issues happened, which also seemed to help somewhat. Now the next thing weirds me out a bit - I switched from using Tomcat7 to using the Jetty that ships with Solr, and that actually seems to have fixed the last issues (together with stopping a few smaller updates - very few).
>
> During the "slow period" in the night, I get something like this:
>
> 03:11:49 ERROR ZkController There was a problem finding the leader in zk:org.apache.solr.common.SolrException: Could not get leader props
> 03:06:47 ERROR Overseer Could not create Overseer node
> 03:06:47 WARN LeaderElector
> 03:06:47 WARN ZkStateReader ZooKeeper watch triggered,​ but Solr cannot talk to ZK
> 03:07:41 WARN RecoveryStrategy Stopping recovery for zkNodeName=solr04.cd-et.com:8080_solr_auto_suggest_shard1_replica2core=auto_suggest_shard1_replica2
>
> After this, the cluster state seems to be fine, and I'm not being spammed with errors in the log files.
>
> Bottom line is that the issues are fixed for now it seems, but I still find it weird that Solr was not able to fully receover.
>
> // Henrik Ossipoff
>
> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com]
> Sent: 10. november 2013 19:27
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud never fully recovers after slow disks
>
> Which version of solr are you using? Regardless of your env, this is a fail safe that you should not hit.
>
> - Mark
>
>> On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen <ho...@entertainment-trading.com> wrote:
>>
>> I previously made a post on this, but have since narrowed down the issue and am now giving this another try, with another spin to it.
>>
>> We are running a 4 node setup (over Tomcat7) with a 3-ensemble external ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM is using our Storage system (NFS share in VMWare).
>>
>> Now I do realize and have heard, that NFS is not the greatest way to run Solr on, but we have never had this issue on non-SolrCloud setups.
>>
>> Basically, each night when we run our backup jobs, our storage becomes a bit slow in response - this is obviously something we’re trying to solve, but bottom line is, that all our other systems somehow stays alive or recovers gracefully when bandwidth exists again.
>> SolrCloud - not so much. Typically after a session like this, 3-5 nodes will either go into a Down state or a Recovering state - and stay that way. Sometimes such node will even be marked as leader. A such node will have something like this in the log:
>>
>> ERROR - 2013-11-05 08:57:45.764;
>> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says we are the leader, but locally we don't think so ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ClusterState says we are the leader (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally we don't think so. Request came from http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
>> at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
>> at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
>> at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
>> at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
>> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
>> at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>> at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>> at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
>> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
>> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
>> at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>> at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>> at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
>> at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
>> at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
>> at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
>> at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
>> at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>> at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
>> at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
>> at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
>> at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:724)
>>
>> On the other nodes, an error similar to this will be in the log:
>>
>> 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode:
>> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apa
>> che.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server
>> at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2
>> returned non ok status:503,​ message:Service Unavailable
>> 09:27:34 -ERROR - SolrCmdDistributor forwarding update to http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - retrying ...
>>
>> Does anyone have any ideas or leads towards a solution - one that doesn’t involve getting a new storage system (a solution we *are* actively working on, but that’s not a quick fix in our case). Shouldn’t a setup like this be possible? And even more so - shouldn’t SolrCloud be able to gracefully recover after issues like this?
>>
>> --
>> Henrik Ossipoff Hansen
>> Developer, Entertainment Trading

RE: SolrCloud never fully recovers after slow disks

Posted by Henrik Ossipoff Hansen <ho...@entertainment-trading.com>.
The joy was short-lived.

Tonight our environment was “down/slow” a bit longer than usual. It looks like two of our nodes never recovered, clusterstate says everything is active. All nodes are throwing this in the log (the nodes they have trouble reaching are the ones that are affected) - the error comes about several cores:

ERROR - 2013-11-11 09:16:42.735; org.apache.solr.common.SolrException; Error while trying to recover. core=products_se_shard1_replica2:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://solr04.cd-et.com:8080/solr
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:431)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
        at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198)
        at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:150)
        at java.net.SocketInputStream.read(SocketInputStream.java:121)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
        at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
        at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
        at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:717)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:522)
        at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
        at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
        at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:365)
        ... 4 more

ERROR - 2013-11-11 09:16:42.736; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (30) core=products_se_shard1_replica2
--
Henrik Ossipoff Hansen
Developer, Entertainment Trading


On 10. nov. 2013 at 21.07.32, Henrik Ossipoff Hansen (hoha@entertainment-trading.com<ma...@entertainment-trading.com>) wrote:

Solr version is 4.5.0.

I have done some tweaking. Doubling my Zookeeper timeout values in zoo.cfg and the Zookeeper timeout in solr.xml seemed to somewhat minimize the problem, but it still did occur. I next stopped all larger batch indexing in the period where the issues happened, which also seemed to help somewhat. Now the next thing weirds me out a bit - I switched from using Tomcat7 to using the Jetty that ships with Solr, and that actually seems to have fixed the last issues (together with stopping a few smaller updates - very few).

During the "slow period" in the night, I get something like this:

03:11:49 ERROR ZkController There was a problem finding the leader in zk:org.apache.solr.common.SolrException: Could not get leader props
03:06:47 ERROR Overseer Could not create Overseer node
03:06:47 WARN LeaderElector
03:06:47 WARN ZkStateReader ZooKeeper watch triggered,​ but Solr cannot talk to ZK
03:07:41 WARN RecoveryStrategy Stopping recovery for zkNodeName=solr04.cd-et.com:8080_solr_auto_suggest_shard1_replica2core=auto_suggest_shard1_replica2

After this, the cluster state seems to be fine, and I'm not being spammed with errors in the log files.

Bottom line is that the issues are fixed for now it seems, but I still find it weird that Solr was not able to fully receover.

// Henrik Ossipoff

-----Original Message-----
From: Mark Miller [mailto:markrmiller@gmail.com]
Sent: 10. november 2013 19:27
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud never fully recovers after slow disks

Which version of solr are you using? Regardless of your env, this is a fail safe that you should not hit.

- Mark

> On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen <ho...@entertainment-trading.com> wrote:
>
> I previously made a post on this, but have since narrowed down the issue and am now giving this another try, with another spin to it.
>
> We are running a 4 node setup (over Tomcat7) with a 3-ensemble external ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM is using our Storage system (NFS share in VMWare).
>
> Now I do realize and have heard, that NFS is not the greatest way to run Solr on, but we have never had this issue on non-SolrCloud setups.
>
> Basically, each night when we run our backup jobs, our storage becomes a bit slow in response - this is obviously something we’re trying to solve, but bottom line is, that all our other systems somehow stays alive or recovers gracefully when bandwidth exists again.
> SolrCloud - not so much. Typically after a session like this, 3-5 nodes will either go into a Down state or a Recovering state - and stay that way. Sometimes such node will even be marked as leader. A such node will have something like this in the log:
>
> ERROR - 2013-11-05 08:57:45.764;
> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says we are the leader, but locally we don't think so ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ClusterState says we are the leader (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally we don't think so. Request came from http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
> at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
> at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
> at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
> at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
> at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
> at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
> at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
> at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
> at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
> at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
> at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
> at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
> at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
> at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
> at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
> at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
>
> On the other nodes, an error similar to this will be in the log:
>
> 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode:
> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apa
> che.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server
> at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2
> returned non ok status:503,​ message:Service Unavailable
> 09:27:34 -ERROR - SolrCmdDistributor forwarding update to http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - retrying ...
>
> Does anyone have any ideas or leads towards a solution - one that doesn’t involve getting a new storage system (a solution we *are* actively working on, but that’s not a quick fix in our case). Shouldn’t a setup like this be possible? And even more so - shouldn’t SolrCloud be able to gracefully recover after issues like this?
>
> --
> Henrik Ossipoff Hansen
> Developer, Entertainment Trading

Re: SolrCloud never fully recovers after slow disks

Posted by Mark Miller <ma...@gmail.com>.
Which version of solr are you using? Regardless of your env, this is a fail safe that you should not hit. 

- Mark

> On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen <ho...@entertainment-trading.com> wrote:
> 
> I previously made a post on this, but have since narrowed down the issue and am now giving this another try, with another spin to it.
> 
> We are running a 4 node setup (over Tomcat7) with a 3-ensemble external ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM is using our Storage system (NFS share in VMWare).
> 
> Now I do realize and have heard, that NFS is not the greatest way to run Solr on, but we have never had this issue on non-SolrCloud setups.
> 
> Basically, each night when we run our backup jobs, our storage becomes a bit slow in response - this is obviously something we’re trying to solve, but bottom line is, that all our other systems somehow stays alive or recovers gracefully when bandwidth exists again.
> SolrCloud - not so much. Typically after a session like this, 3-5 nodes will either go into a Down state or a Recovering state - and stay that way. Sometimes such node will even be marked as leader. A such node will have something like this in the log:
> 
> ERROR - 2013-11-05 08:57:45.764; org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says we are the leader, but locally we don't think so
> ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ClusterState says we are the leader (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally we don't think so. Request came from http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
>        at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
>        at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
>        at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
>        at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
>        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
>        at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
>        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
>        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
>        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
>        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
>        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
>        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
>        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
>        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
>        at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
>        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
>        at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>        at java.lang.Thread.run(Thread.java:724)
> 
> On the other nodes, an error similar to this will be in the log:
> 
> 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode: http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2 returned non ok status:503,​ message:Service Unavailable
> 09:27:34 -ERROR - SolrCmdDistributor forwarding update to http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - retrying ...
> 
> Does anyone have any ideas or leads towards a solution - one that doesn’t involve getting a new storage system (a solution we *are* actively working on, but that’s not a quick fix in our case). Shouldn’t a setup like this be possible? And even more so - shouldn’t SolrCloud be able to gracefully recover after issues like this?
> 
> --
> Henrik Ossipoff Hansen
> Developer, Entertainment Trading

Re: SolrCloud never fully recovers after slow disks

Posted by Erick Erickson <er...@gmail.com>.
Right, can you up your ZK timeouts significantly? It sounds like
your ZK timeout is short enough that when your system slows
down, the timeout is exceeded and it's throwing Solr
into a tailspin....

See zoo.cfg.

Best,
Erick


On Tue, Nov 5, 2013 at 3:33 AM, Henrik Ossipoff Hansen <
hoha@entertainment-trading.com> wrote:

> I previously made a post on this, but have since narrowed down the issue
> and am now giving this another try, with another spin to it.
>
> We are running a 4 node setup (over Tomcat7) with a 3-ensemble external
> ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM
> is using our Storage system (NFS share in VMWare).
>
> Now I do realize and have heard, that NFS is not the greatest way to run
> Solr on, but we have never had this issue on non-SolrCloud setups.
>
> Basically, each night when we run our backup jobs, our storage becomes a
> bit slow in response - this is obviously something we’re trying to solve,
> but bottom line is, that all our other systems somehow stays alive or
> recovers gracefully when bandwidth exists again.
> SolrCloud - not so much. Typically after a session like this, 3-5 nodes
> will either go into a Down state or a Recovering state - and stay that way.
> Sometimes such node will even be marked as leader. A such node will have
> something like this in the log:
>
> ERROR - 2013-11-05 08:57:45.764;
> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
> says we are the leader, but locally we don't think so
> ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException;
> org.apache.solr.common.SolrException: ClusterState says we are the leader (
> http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but
> locally we don't think so. Request came from
> http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
>         at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
>         at
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
>         at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
>         at
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
>         at
> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
>         at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>         at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>         at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>         at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
>         at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
>         at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
>         at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>         at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>         at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
>         at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
>         at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
>         at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
>         at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
>         at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>         at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
>         at
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
>         at
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
>         at
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:724)
>
> On the other nodes, an error similar to this will be in the log:
>
> 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode:
> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> Server at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2returned non ok status:503, message:Service Unavailable
> 09:27:34 -ERROR - SolrCmdDistributor forwarding update to
> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed -
> retrying ...
>
> Does anyone have any ideas or leads towards a solution - one that doesn’t
> involve getting a new storage system (a solution we *are* actively working
> on, but that’s not a quick fix in our case). Shouldn’t a setup like this be
> possible? And even more so - shouldn’t SolrCloud be able to gracefully
> recover after issues like this?
>
> --
> Henrik Ossipoff Hansen
> Developer, Entertainment Trading
>

Re: SolrCloud never fully recovers after slow disks

Posted by Henrik Ossipoff Hansen <ho...@entertainment-trading.com>.
Hey Erick,

I have tried upping the timeouts quite a bit now, and have tried upping the zkTimeout setting in Solr itself (I found a few old posts on the mailing list suggesting this).

I realise this is a sort of weird situation, where we are actually trying to work around some horrible hardware setup.

Thank you for your post - I will make another post in a day or two after I see how it performs.
--
Henrik Ossipoff Hansen
Developer, Entertainment Trading


On 7. nov. 2013 at 13.23.59, Erick Erickson (erickerickson@gmail.com<ma...@gmail.com>) wrote:

Right, can you up your ZK timeouts significantly? It sounds like
your ZK timeout is short enough that when your system slows
down, the timeout is exceeded and it's throwing Solr
into a tailspin....

See zoo.cfg.

Best,
Erick


On Tue, Nov 5, 2013 at 3:33 AM, Henrik Ossipoff Hansen <
hoha@entertainment-trading.com> wrote:

> I previously made a post on this, but have since narrowed down the issue
> and am now giving this another try, with another spin to it.
>
> We are running a 4 node setup (over Tomcat7) with a 3-ensemble external
> ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM
> is using our Storage system (NFS share in VMWare).
>
> Now I do realize and have heard, that NFS is not the greatest way to run
> Solr on, but we have never had this issue on non-SolrCloud setups.
>
> Basically, each night when we run our backup jobs, our storage becomes a
> bit slow in response - this is obviously something we’re trying to solve,
> but bottom line is, that all our other systems somehow stays alive or
> recovers gracefully when bandwidth exists again.
> SolrCloud - not so much. Typically after a session like this, 3-5 nodes
> will either go into a Down state or a Recovering state - and stay that way.
> Sometimes such node will even be marked as leader. A such node will have
> something like this in the log:
>
> ERROR - 2013-11-05 08:57:45.764;
> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
> says we are the leader, but locally we don't think so
> ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException;
> org.apache.solr.common.SolrException: ClusterState says we are the leader (
> http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but
> locally we don't think so. Request came from
> http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
> at
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
> at
> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
> at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
> at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
> at
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
> at
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
> at
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
>
> On the other nodes, an error similar to this will be in the log:
>
> 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode:
> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> Server at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2returned non ok status:503, message:Service Unavailable
> 09:27:34 -ERROR - SolrCmdDistributor forwarding update to
> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed -
> retrying ...
>
> Does anyone have any ideas or leads towards a solution - one that doesn’t
> involve getting a new storage system (a solution we *are* actively working
> on, but that’s not a quick fix in our case). Shouldn’t a setup like this be
> possible? And even more so - shouldn’t SolrCloud be able to gracefully
> recover after issues like this?
>
> --
> Henrik Ossipoff Hansen
> Developer, Entertainment Trading
>