You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sudip Mukherjee <sm...@commvault.com> on 2018/09/05 08:55:04 UTC

Replicas do not come up after nodes are restarted in SOLR cloud

Hi,

I have a 2 node SOLR (7.x) cloud cluster on which I have collection with replicas ( replicationFactor = 2, shard = 1 ). I am seeing that the replicas do not come up ( state is "down")  when both nodes are restarted. From the "legend" in Graph section, I see that the replicas are in
"recovery failed" state.

Below errors are seen :

2018-09-05 14:07:40.157 ERROR (qtp1347137144-10094) [   ] org.apache.solr.servlet.HttpSolrCall null:org.apache.solr.common.SolrException: Error trying to proxy request for url: http://localhost/solr/metadata/select
                at org.apache.solr.servlet.HttpSolrCall.remoteQuery(HttpSolrCall.java:646)
                at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500)
                at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
....
Caused by: java.net.SocketTimeoutException: Read timed out
                at java.net.SocketInputStream.socketRead0(Native Method)
                at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

There are other non-replicated collections and they seem to be "active" and able to query from SOLR UI.


Is this something that we expect when all nodes are restarted? How can we bring the replicas back online from "recovery failed" state?


Thanks,
Sudip Mukherjee



***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************

RE: Replicas do not come up after nodes are restarted in SOLR cloud

Posted by Sudip Mukherjee <sm...@commvault.com>.
Hi Shawn,

Thanks for the insights. 

I verified the points you mentioned,
Socket timeout defaults weren't changed.

Both nodes (two different hosts, windows OS ) are given heap space of 4GB. They are having two Collections as of now.
One is without replicas but with 8 shards on each node.
One is with replicas and 1 shard. (both replicas are down)

I monitored the application for some time and I do not see obvious memory issues or prolonged garbage collection.

Also, the nodes are registered with their hostnames in the collection.


Thanks,
Sudip

-----Original Message-----
From: Shawn Heisey [mailto:elyograg@elyograg.org] 
Sent: Wednesday, September 05, 2018 7:23 PM
To: solr-user@lucene.apache.org
Subject: Re: Replicas do not come up after nodes are restarted in SOLR cloud

On 9/5/2018 2:55 AM, Sudip Mukherjee wrote:
> I have a 2 node SOLR (7.x) cloud cluster on which I have collection 
> with replicas ( replicationFactor = 2, shard = 1 ). I am seeing that the replicas do not come up ( state is "down")  when both nodes are restarted. From the "legend" in Graph section, I see that the replicas are in "recovery failed" state.
<snip>
> Caused by: java.net.SocketTimeoutException: Read timed out
>                  at java.net.SocketInputStream.socketRead0(Native Method)
>                  at 
> java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

Have you changed the socket timeout in Solr's config?

The socket timeout for internode requests defaults to 60 seconds.  If something happened that prevented a Solr server from responding within
60 seconds, then there's something *REALLY* wrong.

My best guess is that your Solr heap is too small, causing Java to spend almost all of its time doing garbage collection.  Or that a too-small heap has caused one of your servers to experience an OutOfMemoryError, which on non-Windows systems will result in the Solr process being killed.

Some questions in case that's not it:

How many collections do you have on this setup?

In the admin UI (Cloud tab), what hostname do your nodes show they are registered as?  If it's localhost, that's going to be a problem for a 2-node cluster.

Thanks,
Shawn

***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************

Re: Replicas do not come up after nodes are restarted in SOLR cloud

Posted by Shawn Heisey <el...@elyograg.org>.
On 9/5/2018 2:55 AM, Sudip Mukherjee wrote:
> I have a 2 node SOLR (7.x) cloud cluster on which I have collection with replicas ( replicationFactor = 2, shard = 1 ). I am seeing that the replicas do not come up ( state is "down")  when both nodes are restarted. From the "legend" in Graph section, I see that the replicas are in
> "recovery failed" state.
<snip>
> Caused by: java.net.SocketTimeoutException: Read timed out
>                  at java.net.SocketInputStream.socketRead0(Native Method)
>                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

Have you changed the socket timeout in Solr's config?

The socket timeout for internode requests defaults to 60 seconds.  If 
something happened that prevented a Solr server from responding within 
60 seconds, then there's something *REALLY* wrong.

My best guess is that your Solr heap is too small, causing Java to spend 
almost all of its time doing garbage collection.  Or that a too-small 
heap has caused one of your servers to experience an OutOfMemoryError, 
which on non-Windows systems will result in the Solr process being killed.

Some questions in case that's not it:

How many collections do you have on this setup?

In the admin UI (Cloud tab), what hostname do your nodes show they are 
registered as?  If it's localhost, that's going to be a problem for a 
2-node cluster.

Thanks,
Shawn